By: Akash Chatake
Publisher: Chatake Innoworks Publications
Edition: First Edition, 2025
Series: Computer Technology & Engineering
Series
Course Code: MSBTE 316316
“Bridging Theory and Practice in the Age of AI”
Ready for systematic assembly - let’s build this properly, one file at a time!
By: Akash Chatake
Organization: Chatake Innoworks
Course Code: 316316 (MSBTE)
Publisher: Chatake Innoworks Publications
Edition: First Edition, 2025
Target Audience: Computer Technology & Engineering
Students
Academic Level: Diploma/Undergraduate
A complete educational resource covering machine learning fundamentals, algorithms, and real-world applications with hands-on Python implementations.
Publication Year: 2025
Location: India
Language: English
Pages: Approximately 500+ pages
Code Examples: 100+ Working Examples
Exercises: 50+ Hands-on Problems
✅ Fully aligned with MSBTE Course Code 316316
✅ Complete coverage of all learning outcomes
✅ Industry-standard best practices included
✅ Practical lab exercises provided
This book represents the culmination of extensive research, practical experience, and educational expertise in the field of machine learning and artificial intelligence.
First Edition, 2025
Copyright © 2025 by Akash Chatake and Chatake Innoworks Organization
All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the copyright holder, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law.
Chatake Innoworks Publications
Publications Division
Chatake Innoworks Organization
India
Publication Date: November 2025
Edition: First Edition
Print ISBN: 978-X-XXXX-XXXX-X (To be
assigned)
Digital ISBN: 978-X-XXXX-XXXX-X (To be
assigned)
This textbook is specifically designed for educational purposes and is aligned with: - MSBTE (Maharashtra State Board of Technical Education) - Course Code: 316316 - Machine Learning - Computer Technology & Engineering Programs
Educational institutions are granted limited permission to use this material for classroom instruction, provided proper attribution is maintained.
Language: English
Target Audience: Diploma and Undergraduate
Students
Subject Classification: Computer Science, Machine
Learning, Artificial Intelligence
Dewey Decimal: 006.31 (Machine Learning)
Library of Congress: QA76.87 (Machine Learning)
The information in this book is provided “as is” without warranty of any kind. The authors and publisher disclaim all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose.
While every effort has been made to ensure accuracy, the authors and publisher assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.
© 2025 Chatake Innoworks. All rights reserved.
To my students, who inspire me daily with their curiosity and determination to understand the mysteries of artificial intelligence and machine learning.
To the educators around the world who dedicate their lives to making complex concepts accessible and igniting the spark of discovery in young minds.
To the open-source community, whose collaborative spirit and generous sharing of knowledge make resources like this possible, democratizing access to cutting-edge technology education.
To the pioneers of machine learning - from Alan Turing to Geoffrey Hinton, from Marvin Minsky to Yann LeCun - whose groundbreaking work laid the foundation for the AI revolution we witness today.
To my family, for their unwavering support and understanding during the countless hours spent crafting this educational resource.
To the Maharashtra State Board of Technical Education (MSBTE) and all educational institutions committed to preparing students for the digital future.
And finally, to every student who will use this book to embark on their journey into the fascinating world of machine learning and artificial intelligence. May this knowledge empower you to solve real-world problems, innovate responsibly, and contribute to building a better tomorrow.
“The best way to predict the future is to invent it.”
✅ Alan Kay
“Machine intelligence is the last invention that humanity will
ever need to make.”
✅ Nick Bostrom
Remember: Every expert was once a beginner. Every pro was once an amateur. Every icon was once an unknown. The journey of a thousand miles begins with a single step.
Your journey into machine learning starts here.
Akash Chatake
November 2025 # ABOUT THE AUTHOR
Founder & Chief Technology Officer, Chatake Innoworks
Akash Chatake is a passionate technology educator, researcher, and innovator with extensive experience in artificial intelligence, machine learning, and educational technology. As the founder of Chatake Innoworks, he leads a dynamic organization focused on bridging the gap between cutting-edge technology and practical education.
With a deep commitment to making complex technical concepts accessible to students, Akash believes in the power of hands-on learning combined with solid theoretical foundations. His teaching methodology emphasizes practical applications, real-world problem-solving, and ethical considerations in technology development.
“To democratize access to high-quality machine learning education and empower the next generation of AI innovators through practical, ethical, and comprehensive learning resources.”
Currently leading initiatives at Chatake Innoworks to: - Develop next-generation educational AI tools - Create comprehensive learning pathways for emerging technologies - Foster ethical AI development practices among students and professionals - Bridge the industry-academia gap in technology education
“Every day, I’m amazed by the potential of artificial intelligence to solve humanity’s greatest challenges. Through education, we can ensure this potential is realized responsibly and beneficially for all. This book represents my contribution to that mission - to create informed, ethical, and capable AI practitioners who will shape our future.”
About Chatake Innoworks
Chatake Innoworks is a forward-thinking organization dedicated to innovation in technology education. Founded with the vision of making advanced technology concepts accessible to all learners, the organization develops cutting-edge educational resources, conducts research in educational technology, and provides consultancy services to academic institutions and industry partners.
Mission: To accelerate human potential through innovative technology education.
Vision: A world where everyone has access to world-class technology education, regardless of their background or location.
For speaking engagements, collaboration opportunities, or educational consultancy, please contact through the official channels listed above. # PREFACE
In the rapidly evolving landscape of technology, few fields have captured the imagination and transformed industries as profoundly as machine learning and artificial intelligence. From the smartphones in our pockets to the recommendation systems that guide our daily choices, from autonomous vehicles navigating our streets to medical AI diagnosing diseases, machine learning has become the invisible force driving the modern world.
As educators and technologists, we recognized a critical gap in the educational resources available to students pursuing computer technology and engineering. While numerous advanced texts exist for researchers and PhD candidates, and countless online tutorials target hobby programmers, there was a distinct need for a comprehensive, academically rigorous yet accessible textbook specifically designed for diploma and undergraduate students following the MSBTE curriculum.
This book was born from that need - to provide a bridge between theoretical computer science and practical, industry-relevant machine learning skills.
We believe that effective learning happens when three elements converge: 1. Solid theoretical foundation - Understanding the ‘why’ behind algorithms 2. Hands-on practical experience - Implementing and experimenting with real code 3. Real-world context - Seeing how concepts apply to actual problems
Every chapter in this book is structured around these three pillars. You’ll find mathematical explanations that build intuition, Python code that you can run and modify, and case studies that demonstrate real applications.
Every learning outcome specified in Course Code 316316 is comprehensively covered. The content, sequence, and depth are carefully calibrated to support both classroom instruction and self-study for MSBTE students.
We start with fundamental concepts and gradually build complexity. Each chapter assumes mastery of previous chapters, creating a scaffolded learning experience that builds confidence and competence simultaneously.
While academically rigorous, this book emphasizes skills that are immediately applicable in industry. From data preprocessing pipelines to model deployment considerations, students will learn practices used in professional machine learning teams.
In an era where AI systems impact millions of lives, we’ve woven ethical considerations throughout the text. Students will learn not just how to build AI systems, but how to build them responsibly.
All examples use open-source tools, primarily Python and its rich ecosystem of machine learning libraries. Students can replicate every example without expensive software licenses.
By the end of this journey, you will be able to:
✅ Understand the fundamental principles underlying
machine learning algorithms
✅ Implement algorithms from scratch and using
professional libraries
✅ Evaluate model performance using appropriate metrics
and validation techniques
✅ Apply feature engineering and data preprocessing
techniques effectively
✅ Deploy machine learning solutions to real-world
problems
✅ Communicate findings and recommendations to both
technical and non-technical audiences
✅ Continue learning independently in this rapidly
evolving field
This book would not have been possible without the contributions of many individuals and organizations:
Machine learning is not a destination but a journey. The field evolves rapidly, with new techniques, tools, and applications emerging continuously. This book provides you with the foundational knowledge and learning skills to adapt and grow with the field.
As you embark on this learning adventure, remember that every expert was once a beginner. Be patient with yourself, celebrate small victories, and never hesitate to ask questions. The machine learning community is remarkably welcoming and collaborative - you’re joining a global network of learners and innovators.
Writing this book has been a labor of love, combining decades of teaching experience with the latest advances in machine learning pedagogy. We’ve tried to anticipate your questions, provide multiple perspectives on difficult concepts, and create a learning experience that is both rigorous and enjoyable.
Your feedback is invaluable to us. As you work through the material, please share your experiences, questions, and suggestions. Education is a collaborative process, and this book will continue to evolve based on the needs of learners like you.
Welcome to the exciting world of machine learning. The future is waiting for the solutions you’ll create.
Happy Learning!
Akash Chatake
Founder, Chatake Innoworks
November 2025
“The beautiful thing about learning is that no one can take it
away from you.”
✅ B.B. King # Machine Learning Textbook - Table of
Contents
Based on MSBTE Syllabus - Course Code: 316316
This comprehensive textbook follows the O’Reilly style and covers all learning outcomes specified in the official syllabus. Each chapter integrates rigorous theoretical foundations with practical implementations, drawing from authoritative sources including Tom Mitchell’s “Machine Learning” and Russell & Norvig’s “Artificial Intelligence: A Modern Approach.” The text provides mathematical derivations, statistical theory, and information-theoretic foundations to ensure both academic rigor and industry applicability.
Learning Outcomes: CO1 - Explain the role of machine learning in AI and data science
Practical Labs: - Installation of IDE with necessary libraries - Basic Python ML script development - Exploring different ML types with examples
Learning Outcomes: CO2 - Implement data preprocessing
Practical Labs: - Data preprocessing pipeline implementation - Reading datasets (Text, CSV, JSON, XML) - Missing value handling techniques - Train-test split implementation
Learning Outcomes: CO3 - Implement feature engineering techniques
Practical Labs: - Feature importance identification programs - PCA implementation for dimensionality reduction - Feature selection pipeline development
Learning Outcomes: CO4 - Apply supervised learning models (Classification)
Practical Labs: - Decision Tree implementation on prepared datasets - KNN model with different K values and performance measurement - SVM model training on given datasets - Classification performance evaluation
Learning Outcomes: CO4 - Apply supervised learning models (Regression)
Practical Labs: - Linear regression implementation with suitable datasets - Logistic regression for binary classification - Ridge regression implementation and comparison - Comprehensive model evaluation pipeline
Learning Outcomes: CO5 - Apply unsupervised learning models
Practical Labs: - K-means clustering for pattern
discovery - Customer segmentation using clustering algorithms
- Visualization using Matplotlib/Seaborn - Hierarchical clustering
implementation
Learning Outcomes: CO5 - Apply unsupervised learning models
Practical Labs: - PCA implementation retaining important information - Dimensionality reduction pipeline development - Visualization of high-dimensional data
Learning Outcomes: Integration of CO1-CO5
Practical Labs: - Complete ML pipeline on real datasets - Boston Housing Dataset analysis and prediction - Waiter’s tip prediction model - Stock market prediction implementation - Human scream detection for crime control
Learning Outcomes: Advanced CO4-CO5 applications
Learning Outcomes: Professional ML practices
Core Theoretical References: - “Machine Learning” by Tom Mitchell (foundational definitions and theory) - “Artificial Intelligence: A Modern Approach” by Russell & Norvig (AI context and reasoning) - “Pattern Recognition and Machine Learning” by Christopher Bishop (Bayesian methods) - “The Elements of Statistical Learning” by Hastie, Tibshirani, and Friedman (statistical theory)
Practical Implementation Guides: - “Hands-On Machine Learning” by Aurélien Géron (practical Python implementations) - “Python Machine Learning” by Sebastian Raschka (Python-focused approach) - “An Introduction to Statistical Learning” by James, Witten, Hastie, and Tibshirani (R-based)
This textbook is designed to provide comprehensive coverage of machine learning concepts while maintaining practical applicability and industry relevance. Each chapter integrates rigorous theoretical foundations with hands-on laboratory exercises, ensuring readers develop both conceptual understanding and practical skills essential for academic success and professional competency. # Chapter 1: Introduction to Machine Learning ## Unit I: Introduction to Machine Learning
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”
✅ Tom Mitchell, Machine Learning (1997)
“Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.”
✅ Russell & Norvig, Artificial Intelligence: A Modern Approach
By the end of this chapter, you will be able to: - TLO
1.1: Describe machine learning concepts and terminology -
TLO 1.2: Compare traditional programming vs ML-based
programming approaches
- TLO 1.3: Distinguish between supervised,
unsupervised, and reinforcement learning - TLO 1.4:
Explain the challenges and limitations of machine learning - TLO
1.5: Explain the features and applications of Python libraries
used for machine learning
Tom Mitchell’s Formal Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.
Let’s break this down with a concrete example: - Task (T): Classifying emails as spam or not spam - Performance Measure (P): Percentage of emails correctly classified - Experience (E): A database of emails labeled as spam or not spam
Russell & Norvig’s Perspective: Machine learning is fundamentally about inductive inference - drawing general conclusions from specific examples. It’s a form of automated reasoning that allows agents to improve their performance through experience.
Machine learning has evolved from academic theory to the backbone of modern technology:
Historical Context: - 1950s: Alan Turing’s “Computing Machinery and Intelligence” - 1959: Arthur Samuel coins the term “machine learning” - 1980s-1990s: Expert systems and statistical methods - 2000s: Big data and computational power explosion - 2010s-Present: Deep learning and AI democratization
Modern Impact: From Netflix recommendations to autonomous vehicles, ML algorithms process over 2.5 quintillion bytes of data daily, making our digital lives more intuitive and efficient.
AI Hierarchy (Russell & Norvig Framework):
Artificial Intelligence
"�✅✅" Machine Learning
"�� "�✅✅" Supervised Learning
"�� "�✅✅" Unsupervised Learning
"�� "��"�✅" Reinforcement Learning
"��"�✅" Other AI Approaches
"�✅✅" Expert Systems
"�✅✅" Logic-based AI
"��"�✅" Search Algorithms
ML in Data Science Pipeline: 1. Data
Collection “�� Raw data gathering 2. Data
Processing”�� Cleaning and preparation
3. Exploratory Analysis “�� Pattern discovery 4.
Machine Learning”�� Model building and prediction 5.
Deployment “�� Production implementation 6.
Monitoring”�� Performance tracking
In traditional programming, we follow a straightforward process:
Data + Program "�� Output
Consider writing a program to identify spam emails. Using traditional programming, you might create rules like:
def is_spam_traditional(email):
spam_indicators = 0
# Manual rules
if "free money" in email.lower():
spam_indicators += 1
if email.count("!") > 3:
spam_indicators += 1
if "urgent" in email.lower():
spam_indicators += 1
return spam_indicators > 2Problems with this approach: - Rules must be manually crafted - Difficult to handle edge cases and exceptions - Poor scalability as complexity increases - Requires domain expertise to create comprehensive rules - Maintenance becomes increasingly difficult over time
Theoretical Foundation: Traditional programming follows a deductive reasoning approach - we start with general rules and apply them to specific cases. This works well for well-defined problems but fails when: 1. The problem space is too complex to enumerate all rules 2. The environment is dynamic and constantly changing 3. We need to handle uncertainty and probabilistic outcomes
Machine learning inverts this paradigm:
Data + Output "�� Program (Model)
Inductive Learning Process (Russell & Norvig): 1. Observation: Collect examples (training data) 2. Hypothesis Formation: Generate potential patterns 3. Testing: Validate hypotheses against new data 4. Refinement: Adjust the model based on performance
# ML approach for spam detection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
def create_spam_classifier():
"""
ML-based spam classifier that learns patterns from data
"""
# Create a pipeline that:
# 1. Converts text to numerical features
# 2. Applies Naive Bayes classification
return Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])
# Training the model
spam_classifier = create_spam_classifier()
# Model learns patterns from labeled examples
spam_classifier.fit(emails_training_data, labels)
# Making predictions on new data
predictions = spam_classifier.predict(new_emails)Key Advantages: - Automatic Pattern Discovery: No manual rule creation needed - Adaptation: Can improve with new data - Generalization: Handles previously unseen cases - Scalability: Performance improves with more data - Difficult to handle edge cases - Requires domain expertise for rule creation - Becomes unwieldy with complex problems
Machine learning flips this paradigm:
Data + Output "�� Program (Model)
Instead of writing explicit rules, we show the computer thousands of examples of spam and legitimate emails, and let it discover the patterns:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# ML approach
ml_spam_detector = Pipeline([
('tfidf', TfidfVectorizer()),
('classifier', MultinomialNB())
])
# Train on examples (data + labels)
ml_spam_detector.fit(email_texts, spam_labels)
# Now it can classify new emails
prediction = ml_spam_detector.predict(["Win free money now!"])Advantages of ML approach: - Learns patterns automatically from data - Adapts to new patterns as more data becomes available - Handles complexity better than manual rules - Often more accurate than human-crafted rules
Tom Mitchell’s Classification: Machine learning paradigms differ in the type of feedback available during training and the nature of the learning task.
Russell & Norvig’s Perspective: Different learning types correspond to different forms of inductive inference and knowledge representation.
Formal Definition (Mitchell): Given a training set of examples {(x”��, y”��), (x”��, y”��), …, (x”�‘, y”�’)} where each xᵢ is an input and yᵢ is the corresponding target output, find a hypothesis h : X “�� Y that accurately predicts y for new inputs x.
Theoretical Foundation: - Learning as Function Approximation: Find function f: X “�� Y - Statistical Learning Theory: Minimize expected risk over unknown distribution - PAC Learning: Probably Approximately Correct learning framework
Working Principle: - Training Phase: Algorithm analyzes input-output pairs to identify patterns - Hypothesis Formation: Creates internal model representing learned relationships - Prediction Phase: Applies learned model to new, unseen inputs - Evaluation: Performance measured against ground truth labels
Mathematical Formulation:
Minimize: E[(h(x) - y)²] [for regression]
Maximize: P(h(x) = y) [for classification]
Key Characteristics: - Feedback Type: Direct supervision through correct answers - Learning Goal: Generalization from labeled examples to unseen data - Performance Measure: Accuracy, precision, recall, MSE, etc. - Data Requirement: Labeled training examples
Predicts discrete categories or classes.
Examples: - Email spam detection (spam/not spam) - Image recognition (cat/dog/bird) - Medical diagnosis (disease/healthy)
# Classification example: Iris flower species
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Train classifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
# Make predictions
predictions = classifier.predict(X_test)
print(f"Accuracy: {classifier.score(X_test, y_test):.2f}")Predicts continuous numerical values.
Examples: - House price prediction - Stock price forecasting - Temperature estimation
# Regression example: Boston housing prices
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load dataset
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
boston.data, boston.target, test_size=0.2, random_state=42
)
# Train regressor
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Make predictions
predictions = regressor.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse:.2f}")Imagine trying to understand the structure of a library when all the books have been randomly scattered with no labels or categories. This is the challenge unsupervised learning tackles - finding meaningful patterns in data without any guidance about what the “right answer” should be.
Mathematical Framework: Given only input data {x”��, x”��, …, x”�’} without corresponding outputs, discover the underlying probability distribution P(x) or find meaningful structure in the data space.
Core Objective: Maximize likelihood P(X|θ) or minimize reconstruction error for discovered patterns.
Learning Process: - Pattern
Discovery: Identify hidden structures, relationships, or
clusters - Dimensionality Understanding: Reduce
complexity while preserving important information
- Density Estimation: Model the underlying data
distribution - Feature Learning: Discover meaningful
representations automatically
Groups similar data points together.
Examples: - Customer segmentation for marketing -
Gene sequencing analysis
- Market research and demographics
# Clustering example: Customer segmentation
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, n_features=2,
random_state=42, cluster_std=1.0)
# Apply K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
cluster_labels = kmeans.fit_predict(X)
# Visualize results
plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
plt.title('Customer Segmentation using K-Means')
plt.show()Reduces the number of features while preserving important information.
Examples: - Data visualization - Noise reduction - Feature compression
# Dimensionality reduction example: PCA
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
# Load high-dimensional data
digits = load_digits()
print(f"Original dimensions: {digits.data.shape}")
# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
digits_2d = pca.fit_transform(digits.data)
print(f"Reduced dimensions: {digits_2d.shape}")
# Visualize
plt.scatter(digits_2d[:, 0], digits_2d[:, 1], c=digits.target, cmap='tab10')
plt.title('Handwritten Digits in 2D (PCA)')
plt.show()Think of learning to ride a bicycle - you don’t have a teacher showing you labeled examples of “correct” and “incorrect” riding positions. Instead, you try different actions and learn from the consequences: staying balanced feels good (positive reward), while falling hurts (negative reward). This is exactly how reinforcement learning works.
Mathematical Foundation: An agent learns optimal policy �“* by maximizing expected cumulative reward:
�"* = argmax E["�� γᵗ r"�" | �"]
Where γ is the discount factor and r”�” is the reward at time t.
The Learning Framework: - Agent: The decision-maker (e.g., game player, robot, trading algorithm) - Environment: The world that provides feedback (e.g., game rules, physical world, market) - State Space S: All possible situations the agent can encounter - Action Space A: All possible actions available to the agent - Reward Function R: Immediate feedback for state-action pairs - Policy �“: The agent’s strategy for choosing actions given states
Examples: - Game playing (Chess, Go, video games) - Autonomous vehicles - Trading algorithms - Robot navigation
# Simple reinforcement learning example: Multi-armed bandit
import numpy as np
import matplotlib.pyplot as plt
class MultiArmedBandit:
def __init__(self, n_arms=3):
self.n_arms = n_arms
# True reward probabilities (unknown to agent)
self.true_rewards = np.random.rand(n_arms)
def pull_arm(self, arm):
# Return 1 with probability true_rewards[arm], else 0
return np.random.rand() < self.true_rewards[arm]
class EpsilonGreedyAgent:
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
def select_arm(self):
if np.random.rand() < self.epsilon:
return np.random.randint(self.n_arms) # Explore
else:
return np.argmax(self.values) # Exploit
def update(self, arm, reward):
self.counts[arm] += 1
# Running average
self.values[arm] += (reward - self.values[arm]) / self.counts[arm]
# Simulation
bandit = MultiArmedBandit(n_arms=3)
agent = EpsilonGreedyAgent(n_arms=3, epsilon=0.1)
rewards = []
for _ in range(1000):
arm = agent.select_arm()
reward = bandit.pull_arm(arm)
agent.update(arm, reward)
rewards.append(reward)
print(f"True reward rates: {bandit.true_rewards}")
print(f"Learned values: {agent.values}")
print(f"Average reward: {np.mean(rewards):.3f}")According to Tom Mitchell, machine learning faces fundamental challenges rooted in the bias-variance tradeoff and the no free lunch theorem. Russell & Norvig emphasize that these challenges stem from the inherent difficulty of inductive inference - making reliable generalizations from limited data.
Theoretical Framework (Mitchell’s Learning Theory): - Hypothesis Space (H): Set of all possible models - Version Space: Subset of hypotheses consistent with training data - Inductive Bias: Assumptions that guide hypothesis selection
Key Challenge: Finding the right balance between: - Generalization: Performance on unseen data - Specialization: Fitting the training data
Theoretical Perspective: The PAC Learning Framework (Probably Approximately Correct) requires that training data be: - Representative: Drawn from the same distribution as test data - Sufficient: Enough samples for statistical significance - Clean: Free from systematic errors
Common Data Problems: - Missing Data: Incomplete feature vectors - Impact: Reduces effective sample size - Solutions: Imputation, deletion, or model-based approaches
# Example: Detecting and handling data quality issues
import pandas as pd
import numpy as np
def assess_data_quality(df):
"""
Comprehensive data quality assessment
"""
quality_report = {
'missing_values': df.isnull().sum(),
'duplicate_rows': df.duplicated().sum(),
'data_types': df.dtypes,
'outliers_detected': {},
'potential_bias': {}
}
# Detect outliers using IQR method
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
quality_report['outliers_detected'][col] = len(outliers)
return quality_reportTheoretical Foundation (Mitchell’s Analysis): Total Error = Bias² + Variance + Noise
Bias: Error from overly simplistic assumptions - High bias “�� Underfitting - Algorithm consistently misses relevant patterns
Variance: Error from sensitivity to small fluctuations in training set - High variance “�� Overfitting - Algorithm memorizes training data noise
# Demonstration of bias-variance tradeoff
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)
# Analyze bias-variance tradeoff with tree depth
param_range = range(1, 21)
train_scores, validation_scores = validation_curve(
DecisionTreeRegressor(random_state=42), X, y,
param_name='max_depth', param_range=param_range,
cv=5, scoring='neg_mean_squared_error'
)
# Visualize the tradeoff
plt.figure(figsize=(10, 6))
plt.plot(param_range, -train_scores.mean(axis=1), 'o-', label='Training Error')
plt.plot(param_range, -validation_scores.mean(axis=1), 'o-', label='Validation Error')
plt.xlabel('Tree Depth (Model Complexity)')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.title('Bias-Variance Tradeoff in Decision Trees')
plt.show()Theoretical Background: As dimensionality increases, the volume of the space grows exponentially, making data increasingly sparse. This phenomenon, first described by Richard Bellman, poses significant challenges:
Mathematical Formulation: In d-dimensional space, the ratio of volume of a hypersphere to its enclosing hypercube approaches 0 as d “��”��.
Practical Implications: - Need exponentially more data to maintain density - Distance-based algorithms become ineffective - Visualization becomes impossible
Solutions: - Feature Selection: Choose most relevant features - Dimensionality Reduction: PCA, t-SNE, UMAP - Regularization: Penalize model complexity
Theoretical Framework: ML algorithm complexity analysis using Big O notation:
Training Complexity: Time to build model - Linear models: O(n × d) - SVMs: O(n³) - Deep networks: O(epochs × n × parameters)
Prediction Complexity: Time to make predictions - Linear models: O(d) - k-NN: O(n × d) - Decision trees: O(log n)
# Time complexity demonstration
import time
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
def measure_training_time(algorithm, X, y):
"""Measure training time for different algorithms"""
start_time = time.time()
algorithm.fit(X, y)
end_time = time.time()
return end_time - start_time
# Compare algorithms on datasets of varying sizes
sample_sizes = [100, 500, 1000, 5000]
algorithms = {
'Linear Regression': LinearRegression(),
'SVM': SVR(),
'k-NN': KNeighborsRegressor()
}
for size in sample_sizes:
X, y = make_regression(n_samples=size, n_features=10, random_state=42)
print(f"\nDataset size: {size} samples")
for name, algo in algorithms.items():
training_time = measure_training_time(algo, X, y)
print(f"{name}: {training_time:.4f} seconds")Russell & Norvig Perspective: As AI systems become more complex, the need for transparent reasoning becomes critical for trust and adoption.
Levels of Interpretability: 1. Global Interpretability: Understanding entire model behavior 2. Local Interpretability: Understanding individual predictions 3. Counterfactual Explanations: “What if” scenarios
Trade-offs: - Simple models (linear regression, decision trees): High interpretability, potentially lower accuracy - Complex models (neural networks, ensembles): High accuracy, lower interpretability
Algorithmic Fairness: Ensuring ML systems don’t perpetuate or amplify societal biases
Types of Bias: - Historical Bias: Training data reflects past discrimination - Representation Bias: Underrepresentation of certain groups - Measurement Bias: Systematic errors in data collection
Fairness Definitions: - Statistical Parity: Equal positive prediction rates across groups - Equalized Odds: Equal true positive and false positive rates - Individual Fairness: Similar individuals receive similar outcomes
# Example: Measuring algorithmic bias
from sklearn.metrics import confusion_matrix
import pandas as pd
def measure_bias(y_true, y_pred, sensitive_attribute):
"""
Measure bias in predictions across sensitive groups
"""
results = {}
for group in sensitive_attribute.unique():
mask = sensitive_attribute == group
group_true = y_true[mask]
group_pred = y_pred[mask]
# Calculate group-specific metrics
tn, fp, fn, tp = confusion_matrix(group_true, group_pred).ravel()
results[group] = {
'accuracy': (tp + tn) / (tp + tn + fp + fn),
'true_positive_rate': tp / (tp + fn) if (tp + fn) > 0 else 0,
'false_positive_rate': fp / (fp + tn) if (fp + tn) > 0 else 0,
'positive_prediction_rate': (tp + fp) / len(group_pred)
}
return results
# Example usage would require actual data with sensitive attributesWhen Guido van Rossum created Python in 1991, he probably didn’t imagine it would become the lingua franca of machine learning. Yet here we are - from Google’s TensorFlow to scikit-learn, the most powerful ML tools speak Python.
But why did Python win over languages like R, Java, or C++? The answer lies in what we call the “Goldilocks principle” - Python is not too complex like C++, not too domain-specific like R, but just right for the diverse needs of machine learning practitioners.
The Python Advantage in ML:
Simplicity Meets Power: Python’s syntax reads almost like natural language. Compare implementing a neural network in C++ versus Python - what takes hundreds of lines in C++ can be done in a dozen lines of Python.
Scientific Computing Foundation: Python wasn’t built for ML, but its scientific computing ecosystem was. Libraries like NumPy provide the mathematical backbone that makes ML computations feasible.
Rapid Prototyping: In machine learning, you spend more time experimenting than implementing. Python’s interactive nature and notebook environments (like Jupyter) make it perfect for the iterative process of model development.
Community and Ecosystem: When you face an ML problem, chances are someone has already solved it in Python. The extensive library ecosystem means you’re building on giant shoulders.
Foundation for scientific computing in Python.
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Mathematical operations
print(np.mean(arr)) # 3.0
print(np.std(arr)) # 1.58
print(matrix.dot(matrix)) # Matrix multiplicationKey Features: - Efficient array operations - Mathematical functions - Linear algebra operations - Random number generation
Powerful data structures and analysis tools.
import pandas as pd
# Create DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Data operations
print(df.describe()) # Statistical summary
print(df.groupby('Age').mean()) # Group operations
df_filtered = df[df['Age'] > 25] # FilteringKey Features: - DataFrame and Series data structures - Data cleaning and transformation - File I/O (CSV, Excel, JSON, etc.) - Grouping and aggregation operations
Comprehensive plotting library.
import matplotlib.pyplot as plt
# Simple plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.figure(figsize=(8, 6))
plt.plot(x, y, marker='o')
plt.title('Simple Line Plot')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.grid(True)
plt.show()Key Features: - Line plots, scatter plots, histograms - Subplots and multi-panel figures - Customizable styling - Export to various formats
Comprehensive ML library with consistent API.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Typical ML workflow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)Key Features: - Classification and regression algorithms - Model selection and evaluation tools - Preprocessing utilities - Consistent API across algorithms
# Download and install Anaconda from https://anaconda.com
# Create ML environment
conda create -n ml_env python=3.9
conda activate ml_env
conda install numpy pandas matplotlib scikit-learn jupyter# Create virtual environment
python -m venv ml_env
source ml_env/bin/activate # On Windows: ml_env\Scripts\activate
# Install packages
pip install numpy pandas matplotlib scikit-learn jupyter notebookLet’s create a complete machine learning pipeline:
# complete_ml_example.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns
# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
def main():
print("🤖 Welcome to Machine Learning with Python!")
print("="*50)
# 1. Load Data
print("\n📊 Loading Iris dataset...")
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print(f"Dataset shape: {X.shape}")
print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
# 2. Create DataFrame for easy manipulation
df = pd.DataFrame(X, columns=feature_names)
df['species'] = y
print("\n📋 Dataset overview:")
print(df.head())
print(f"\nDataset info:")
print(df.describe())
# 3. Visualize Data
plt.figure(figsize=(12, 8))
# Pairplot
plt.subplot(2, 2, 1)
for i, species in enumerate(target_names):
mask = y == i
plt.scatter(X[mask, 0], X[mask, 1],
label=species, alpha=0.7)
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('Sepal Dimensions')
plt.legend()
# Feature distributions
plt.subplot(2, 2, 2)
df.boxplot(column=feature_names[0], by='species', ax=plt.gca())
plt.title('Sepal Length Distribution')
plt.tight_layout()
plt.show()
# 4. Split Data
print("\n🔄 Splitting data into train/test sets...")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# 5. Train Model
print("\n🎯 Training Random Forest classifier...")
model = RandomForestClassifier(
n_estimators=100,
random_state=42,
max_depth=3
)
model.fit(X_train, y_train)
print("✅ Model training completed!")
# 6. Make Predictions
print("\n🔮 Making predictions...")
y_pred = model.predict(X_test)
# 7. Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
print(f"\n📊 Model Performance:")
print(f"Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred,
target_names=target_names))
# 8. Feature Importance
importance = model.feature_importances_
plt.figure(figsize=(10, 6))
indices = np.argsort(importance)[::-1]
plt.subplot(1, 2, 1)
plt.bar(range(len(importance)), importance[indices])
plt.xticks(range(len(importance)),
[feature_names[i] for i in indices],
rotation=45)
plt.title('Feature Importance')
# 9. Prediction Probabilities
probabilities = model.predict_proba(X_test)
plt.subplot(1, 2, 2)
for i in range(len(target_names)):
plt.hist(probabilities[:, i], alpha=0.7,
label=f'{target_names[i]} confidence')
plt.xlabel('Prediction Probability')
plt.ylabel('Count')
plt.title('Prediction Confidence Distribution')
plt.legend()
plt.tight_layout()
plt.show()
# 10. Make predictions on new data
print("\n🆕 Testing on new samples:")
new_samples = np.array([
[5.1, 3.5, 1.4, 0.2], # Should be Setosa
[6.0, 2.7, 5.1, 1.9], # Should be Virginica
])
new_predictions = model.predict(new_samples)
new_probabilities = model.predict_proba(new_samples)
for i, (sample, pred, probs) in enumerate(zip(new_samples, new_predictions, new_probabilities)):
print(f"\nSample {i+1}: {sample}")
print(f"Predicted class: {target_names[pred]}")
print(f"Probabilities: {dict(zip(target_names, probs))}")
print("\n🎉 Machine Learning pipeline completed successfully!")
print("This example demonstrated:")
print("✅ Data loading and exploration")
print("✅ Data visualization")
print("✅ Model training")
print("✅ Performance evaluation")
print("✅ Feature importance analysis")
print("✅ Making predictions on new data")
if __name__ == "__main__":
main()Machine Learning vs Traditional Programming: ML learns patterns from data rather than following explicit rules.
Three Types of ML:
Real-world Applications: ML is transforming industries from healthcare to finance to transportation.
Python Ecosystem: NumPy, Pandas, Matplotlib, and Scikit-learn form the foundation of ML in Python.
Challenges Exist: Data quality, overfitting, interpretability, and ethical considerations are ongoing challenges.
In the next chapter, we’ll dive deep into data preprocessing✅the crucial first step in any machine learning project. You’ll learn how to clean messy data, handle missing values, and prepare your datasets for training robust ML models.
Create examples for each type of machine learning using different datasets: 1. Supervised Classification: Use the wine dataset to classify wine types 2. Supervised Regression: Use the California housing dataset to predict prices 3. Unsupervised Clustering: Apply K-means to customer data 4. Dimensionality Reduction: Use PCA on high-dimensional data
Compare the traditional rule-based approach vs. ML approach for: 1. Temperature conversion (Celsius to Fahrenheit) 2. Email spam detection 3. Image recognition
Discuss when each approach is more appropriate.
Research and document three specific ML applications in: 1. Your field of study or interest 2. A local business or organization 3. A global challenge (climate change, healthcare, poverty, etc.)
For each application, identify: - Type of ML problem (classification, regression, clustering, etc.) - Input data and features - Expected output - Success metrics - Potential challenges
This chapter has introduced you to the exciting world of machine learning. The journey ahead will equip you with practical skills to solve real-world problems using data and algorithms. Let’s continue building your ML expertise! # Chapter 2: Data Preprocessing
“Data preprocessing is like preparing ingredients before cooking✅no matter how skilled the chef, poor ingredients lead to poor results.”
✅ Anonymous Data Scientist
By the end of this chapter, you’ll master: - Essential data cleaning techniques - Methods for handling missing values strategically - Dataset splitting strategies for robust model evaluation - Data quality assessment and validation - Best practices for preprocessing pipelines
Before any machine learning algorithm can extract meaningful patterns, the data must satisfy certain mathematical and statistical assumptions. In practice, raw data rarely meets these requirements, making preprocessing not just helpful, but mathematically necessary.
Consider a learning algorithm trying to minimize a loss function L(θ) over a dataset D = {(x”��, y”��), (x”��, y”��), …, (x”�‘, y”�’)}. If our data contains significant noise ε, missing values, or inconsistent scaling, the optimization process becomes:
L’(θ) = L(θ) + N(ε) + M(missing) + S(scale)
Where N(ε) represents noise interference, M(missing) accounts for missing value bias, and S(scale) reflects scaling inconsistencies. Each term can dramatically alter the loss landscape, leading algorithms toward suboptimal solutions.
This mathematical reality explains why preprocessing isn’t just practical housekeeping✅it’s about ensuring our algorithms can find the true underlying patterns rather than fitting to artifacts in the data. Think of it as preparing the mathematical foundation upon which learning can occur.
The famous “garbage in, garbage out” principle has deep mathematical roots: if our input data violates the assumptions of our chosen algorithm (independence, normality, homoscedasticity, etc.), even sophisticated models will produce unreliable results.
Real-world datasets systematically violate the clean, well-structured assumptions underlying most machine learning theory. Understanding these violations from a statistical perspective helps us choose appropriate preprocessing strategies.
The Data Quality Spectrum
From a theoretical standpoint, data quality issues can be categorized by their impact on statistical learning:
Consider how each type affects the fundamental learning equation:
P(hypothesis|data) “�” P(data|hypothesis) × P(hypothesis)
When data quality issues are present, we’re actually working with:
P(hypothesis|corrupted_data) “�” P(corrupted_data|hypothesis) × P(hypothesis)
The corrupted likelihood P(corrupted_data|hypothesis) can lead to entirely different posterior beliefs about which hypothesis best explains our observations.
Real-world datasets come with numerous challenges:
# Example of messy real-world data
messy_data = {
'customer_name': ['John Doe', 'jane smith', 'BOB JOHNSON', None, ''],
'age': [25, 'thirty', 150, -5, None],
'income': ['$50,000', '75000', '$invalid$', '', '60K'],
'email': ['john@email.com', 'invalid-email', 'jane@.com', None, 'bob@email.com'],
'signup_date': ['2023-01-15', '01/15/2023', 'invalid', '2023-13-45', None]
}Common Data Quality Issues: - Missing values: Empty cells, null values, placeholder text - Inconsistent formatting: Different date formats, mixed case text - Outliers: Unrealistic values (age = 150, negative income) - Duplicates: Identical or near-identical records - Invalid data: Malformed emails, impossible dates - Mixed data types: Numbers stored as text, inconsistent units
Data cleaning is fundamentally about identifying and correcting deviations from the true underlying data generating process. From a statistical perspective, we can formalize data cleaning as the process of transforming observed data X_obs to recover the true data X_true.
Mathematical Framework of Data Quality
Let’s define the relationship between observed and true data:
X_obs = X_true + ε_noise + ε_systematic + ε_missing
Where: - ε_noise represents random measurement errors - ε_systematic represents consistent biases or distortions - ε_missing represents information loss due to missing values
The goal of data cleaning is to estimate and minimize these error components, thereby recovering an approximation of X_true that preserves the essential statistical properties needed for learning.
Data Quality Metrics
We can quantify data quality using several statistical measures:
Data cleaning involves systematically identifying and correcting deviations from these quality standards.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def assess_data_quality(df):
"""Comprehensive data quality assessment"""
print("📊 DATA QUALITY ASSESSMENT")
print("=" * 50)
# Basic info
print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage().sum() / 1024:.2f} KB")
# Missing values
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
print(f"\n🔍 Missing Values:")
for col in df.columns:
if missing[col] > 0:
print(f" {col}: {missing[col]} ({missing_percent[col]:.1f}%)")
# Data types
print(f"\n📋 Data Types:")
for col in df.columns:
print(f" {col}: {df[col].dtype}")
# Potential duplicates
duplicates = df.duplicated().sum()
print(f"\n🔄 Duplicate rows: {duplicates}")
# Unique values (for categorical columns)
print(f"\n🏷️ Unique Values (categorical columns):")
for col in df.select_dtypes(include=['object']).columns:
unique_count = df[col].nunique()
print(f" {col}: {unique_count} unique values")
if unique_count <= 10: # Show values if few
print(f" Values: {list(df[col].dropna().unique())}")
# Example usage
sample_data = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Alice', None],
'age': [25, 30, None, 25, 35],
'salary': [50000, 60000, 75000, 50000, None],
'department': ['Engineering', 'Sales', 'Engineering', 'Engineering', 'Marketing']
})
assess_data_quality(sample_data)def clean_text_data(series):
"""Clean and standardize text data"""
return (series
.str.strip() # Remove leading/trailing spaces
.str.lower() # Convert to lowercase
.str.replace(r'[^\w\s]', '', regex=True) # Remove special characters
.str.replace(r'\s+', ' ', regex=True) # Normalize whitespace
)
def standardize_phone_numbers(series):
"""Standardize phone number format"""
return (series
.str.replace(r'[^\d]', '', regex=True) # Keep only digits
.str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True)
)
def parse_currency(series):
"""Convert currency strings to numeric values"""
return (series
.str.replace(r'[\$,K]', '', regex=True) # Remove $, commas, K
.str.replace('K', '000') # Convert K to thousands
.astype(float)
)
# Example
messy_text = pd.Series([' JOHN DOE ', 'jane-smith!!!', 'Bob Johnson'])
clean_text = clean_text_data(messy_text)
print("Original:", messy_text.tolist())
print("Cleaned:", clean_text.tolist())def handle_duplicates(df, subset=None, keep='first'):
"""Identify and handle duplicate records"""
# Find duplicates
duplicates = df.duplicated(subset=subset, keep=False)
print(f"📋 Duplicate Analysis:")
print(f"Total duplicates: {duplicates.sum()}")
if duplicates.sum() > 0:
print(f"Duplicate records:")
print(df[duplicates].sort_values(by=df.columns.tolist()))
# Remove duplicates
df_clean = df.drop_duplicates(subset=subset, keep=keep)
print(f"Rows before: {len(df)}")
print(f"Rows after: {len(df_clean)}")
return df_clean
return dfdef detect_outliers(series, method='iqr', threshold=1.5):
"""Detect outliers using different methods"""
if method == 'iqr':
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - threshold * IQR
upper_bound = Q3 + threshold * IQR
outliers = (series < lower_bound) | (series > upper_bound)
elif method == 'zscore':
z_scores = np.abs((series - series.mean()) / series.std())
outliers = z_scores > threshold
elif method == 'percentile':
lower_bound = series.quantile(0.01)
upper_bound = series.quantile(0.99)
outliers = (series < lower_bound) | (series > upper_bound)
return outliers
def visualize_outliers(df, column):
"""Visualize outliers in a column"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Box plot
axes[0].boxplot(df[column].dropna())
axes[0].set_title(f'{column} - Box Plot')
axes[0].set_ylabel('Value')
# Histogram
axes[1].hist(df[column].dropna(), bins=30, alpha=0.7)
axes[1].set_title(f'{column} - Histogram')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
# Scatter plot with outliers highlighted
outliers = detect_outliers(df[column].dropna())
axes[2].scatter(range(len(df[column].dropna())),
df[column].dropna(),
c=['red' if x else 'blue' for x in outliers],
alpha=0.6)
axes[2].set_title(f'{column} - Outliers (red)')
axes[2].set_xlabel('Index')
axes[2].set_ylabel('Value')
plt.tight_layout()
plt.show()
# Example usage
np.random.seed(42)
data_with_outliers = np.concatenate([
np.random.normal(50, 10, 100), # Normal data
[150, 200, -50] # Outliers
])
outliers = detect_outliers(pd.Series(data_with_outliers))
print(f"Detected {outliers.sum()} outliers")Missing data poses one of the most significant challenges to valid statistical inference. The mechanism that causes data to be missing determines both the appropriate handling strategy and the validity of our conclusions.
Rubin’s Missing Data Theory
Donald Rubin’s seminal work established the mathematical framework for understanding missing data mechanisms. Let R be a missing data indicator matrix where R_ij = 1 if X_ij is observed and R_ij = 0 if X_ij is missing.
The missing data mechanism is characterized by the conditional probability:
P(R | X_obs, X_miss, ψ)
Where ψ represents parameters governing the missing data process.
1. Missing Completely at Random (MCAR) - Formal Definition: P(R | X_obs, X_miss, ψ) = P(R | ψ) - Mathematical Implication: Missing values are statistically independent of all data values - Key Property: The observed data is a random subsample of the complete data - Statistical Consequence: Complete case analysis yields unbiased estimates (though with reduced power) - Example: Laboratory equipment randomly failing during data collection
2. Missing at Random (MAR)
- Formal Definition: P(R | X_obs, X_miss, ψ) = P(R |
X_obs, ψ) - Mathematical Implication: Missingness
depends only on observed data, not on missing values - Key
Property: Given the observed data, the missing data mechanism
is ignorable - Statistical Consequence: Maximum
likelihood and Bayesian inference remain valid under MAR -
Example: Survey non-response that varies systematically
by age or education level
3. Missing Not at Random (MNAR) - Formal
Definition: P(R | X_obs, X_miss, ψ) depends on X_miss -
Mathematical Implication: Missingness depends on the
unobserved values themselves
- Key Property: The missing data mechanism is
non-ignorable - Statistical Consequence: Requires
explicit modeling of the missing data mechanism -
Example: High earners systematically refusing to
disclose income information
Statistical Testing for Missing Data Mechanisms
We can test MCAR using Little’s test, which examines whether missing data patterns are consistent with MCAR:
H”�“: Data are MCAR Test Statistic: Follows χ² distribution under null hypothesis
Understanding the missing data mechanism is crucial because different mechanisms require fundamentally different analytical approaches.
def analyze_missing_values(df):
"""Comprehensive missing value analysis"""
# Calculate missing statistics
missing_stats = pd.DataFrame({
'Column': df.columns,
'Missing_Count': df.isnull().sum(),
'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,
'Data_Type': df.dtypes
})
missing_stats = missing_stats[missing_stats['Missing_Count'] > 0].sort_values(
'Missing_Percentage', ascending=False
)
print("🔍 Missing Value Analysis:")
print(missing_stats.to_string(index=False))
# Visualize missing patterns
if not missing_stats.empty:
plt.figure(figsize=(12, 6))
# Missing value heatmap
plt.subplot(1, 2, 1)
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
plt.title('Missing Value Pattern')
# Missing value bar chart
plt.subplot(1, 2, 2)
missing_stats.plot(x='Column', y='Missing_Percentage', kind='bar', ax=plt.gca())
plt.title('Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Missing Percentage (%)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
return missing_stats
# Example dataset with missing values
sample_data = pd.DataFrame({
'age': [25, None, 35, 40, None, 30],
'income': [50000, 60000, None, 80000, 55000, None],
'education': ['Bachelor', 'Master', None, 'PhD', 'Bachelor', 'Master'],
'experience': [2, 5, None, 15, 3, 7]
})
missing_analysis = analyze_missing_values(sample_data)def remove_missing_data(df, strategy='rows', threshold=0.5):
"""Remove missing data using different strategies"""
if strategy == 'rows':
# Remove rows with any missing values
df_clean = df.dropna()
print(f"Removed {len(df) - len(df_clean)} rows with missing values")
elif strategy == 'columns':
# Remove columns with missing values above threshold
missing_percent = df.isnull().sum() / len(df)
cols_to_drop = missing_percent[missing_percent > threshold].index
df_clean = df.drop(columns=cols_to_drop)
print(f"Removed columns: {list(cols_to_drop)}")
elif strategy == 'selective':
# Remove rows only if multiple values are missing
df_clean = df.dropna(thresh=len(df.columns) - 1) # Keep if at most 1 missing
print(f"Removed {len(df) - len(df_clean)} rows with multiple missing values")
return df_clean
# Example
print("Original shape:", sample_data.shape)
cleaned_data = remove_missing_data(sample_data, strategy='selective')
print("After cleaning:", cleaned_data.shape)from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
def impute_missing_values(df, strategy='mean', n_neighbors=5):
"""Impute missing values using various strategies"""
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns
df_imputed = df.copy()
# Handle numeric columns
if len(numeric_cols) > 0:
if strategy in ['mean', 'median', 'constant']:
imputer = SimpleImputer(strategy=strategy, fill_value=0 if strategy=='constant' else None)
df_imputed[numeric_cols] = imputer.fit_transform(df[numeric_cols])
elif strategy == 'knn':
imputer = KNNImputer(n_neighbors=n_neighbors)
df_imputed[numeric_cols] = imputer.fit_transform(df[numeric_cols])
elif strategy == 'iterative':
imputer = IterativeImputer(random_state=42)
df_imputed[numeric_cols] = imputer.fit_transform(df[numeric_cols])
# Handle categorical columns (mode imputation)
if len(categorical_cols) > 0:
cat_imputer = SimpleImputer(strategy='most_frequent')
df_imputed[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])
return df_imputed
# Compare different imputation strategies
strategies = ['mean', 'median', 'knn']
imputation_results = {}
for strategy in strategies:
imputed_data = impute_missing_values(sample_data, strategy=strategy)
imputation_results[strategy] = imputed_data
print(f"\n{strategy.upper()} Imputation Results:")
print(imputed_data[['age', 'income']].describe())def custom_imputation(df):
"""Custom imputation based on business logic"""
df_custom = df.copy()
# Example: Impute income based on education level
education_income_map = {
'Bachelor': df.groupby('education')['income'].mean()['Bachelor'],
'Master': df.groupby('education')['income'].mean()['Master'],
'PhD': df.groupby('education')['income'].mean()['PhD']
}
# Fill missing income based on education
for idx, row in df_custom.iterrows():
if pd.isna(row['income']) and pd.notna(row['education']):
df_custom.at[idx, 'income'] = education_income_map.get(row['education'], df['income'].mean())
return df_custom
def forward_fill_imputation(df, time_column):
"""Forward fill for time series data"""
df_sorted = df.sort_values(time_column)
df_filled = df_sorted.fillna(method='ffill') # Forward fill
return df_filled
# Time series example
time_series_data = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=10, freq='D'),
'temperature': [20, None, 22, None, None, 25, 24, None, 23, 22],
'humidity': [60, 65, None, 70, 68, None, 72, 71, None, 69]
})
filled_ts = forward_fill_imputation(time_series_data, 'date')
print("Time series with forward fill:")
print(filled_ts)Dataset splitting addresses one of the fundamental challenges in statistical learning: estimating how well our model will perform on future, unseen data. This is formalized through the bias-variance decomposition of generalization error.
Mathematical Foundation of Generalization
Consider a learning algorithm that produces hypothesis h based on training data D_train. The true risk (generalization error) is:
R(h) = E_{(x,y)~P}[L(h(x), y)]
Since we can’t compute this directly, we estimate it using test data D_test:
R̂(h) = (1/|D_test|) Σ_{(x,y)“��D_test} L(h(x), y)
The key insight is that R̂(h) is an unbiased estimator of R(h) only if D_test is drawn from the same distribution as future data and is independent of the training process.
The Fundamental Trade-off in Data Splitting
When splitting dataset D into training D_train and testing D_test portions, we face a fundamental trade-off:
Mathematically: If |D| = n, |D_train| = k, then: - Training Error Bias “�” 1/k (decreases as training set grows) - Test Error Variance “�” 1/(n-k) (decreases as test set grows)
Statistical Properties of Different Split Ratios
The optimal split ratio depends on the bias-variance characteristics of our learning algorithm:
Cross-Validation: A Statistical Sampling Perspective
Cross-validation provides a more sophisticated approach by treating dataset splitting as a statistical sampling problem. K-fold CV estimates generalization error as:
**CV_k = (1/k) Σ_{i=1}^k R̂_i(h_i)**
Where each R̂_i is computed on the i-th fold, and h_i is trained on the remaining k-1 folds.
The statistical properties of this estimator are well-understood: - Bias: Generally lower than single train-test split - Variance: Depends on k (higher k “�� lower bias, higher variance) - Computational Cost: k times higher than single split
Dataset splitting is fundamentally about creating reliable statistical estimates of future performance.
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
def basic_train_test_split(X, y, test_size=0.2, random_state=42):
"""Basic train-test split with explanation"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state
)
print(f"📊 Dataset Split Information:")
print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
return X_train, X_test, y_train, y_test
# Example with Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = basic_train_test_split(X, y)
# Check class distribution
print(f"\n🏷️ Class Distribution:")
print(f"Training: {np.bincount(y_train)}")
print(f"Test: {np.bincount(y_test)}")Stratified sampling addresses a critical statistical issue: ensuring that our training and test distributions remain representative of the population distribution, particularly when dealing with imbalanced classes or important subgroups.
Mathematical Motivation
Consider a dataset with class proportions �✅�, �✅�, …, �✅�. Under simple random sampling, the probability that our test set has dramatically different class proportions follows a multinomial distribution. For small datasets or rare classes, this can lead to:
Stratified Sampling Algorithm
Stratified sampling ensures that each stratum (class) is represented proportionally:
For each class i: - n_i^train = “��n_i × (1 - test_ratio)”�� - n_i^test = n_i - n_i^train
This guarantees that **P̂_test(class = i) “�� P_population(class = i)** for all classes.
Statistical Properties
Compared to simple random sampling, stratified sampling provides:
When Stratification is Critical
Stratification becomes mathematically necessary when: - Class imbalance ratio > 10:1 (high probability of missing rare classes) - Small dataset size (n < 1000, where sampling variation is high) - Multi-class problems with > 5 classes (combinatorial explosion of possible imbalances)
For imbalanced datasets, stratified sampling ensures each split contains approximately the same percentage of samples from each class, maintaining the statistical validity of our evaluation.
def stratified_split_analysis(X, y, test_size=0.2):
"""Compare regular vs stratified splitting"""
# Regular split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X, y, test_size=test_size, random_state=42
)
# Stratified split
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
X, y, test_size=test_size, random_state=42, stratify=y
)
# Compare distributions
print("📊 Split Comparison:")
print("\nRegular Split:")
print(f" Train distribution: {np.bincount(y_train_reg) / len(y_train_reg)}")
print(f" Test distribution: {np.bincount(y_test_reg) / len(y_test_reg)}")
print("\nStratified Split:")
print(f" Train distribution: {np.bincount(y_train_strat) / len(y_train_strat)}")
print(f" Test distribution: {np.bincount(y_test_strat) / len(y_test_strat)}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Regular split
axes[0,0].bar(range(len(np.bincount(y_train_reg))), np.bincount(y_train_reg))
axes[0,0].set_title('Regular Split - Training')
axes[0,1].bar(range(len(np.bincount(y_test_reg))), np.bincount(y_test_reg))
axes[0,1].set_title('Regular Split - Test')
# Stratified split
axes[1,0].bar(range(len(np.bincount(y_train_strat))), np.bincount(y_train_strat))
axes[1,0].set_title('Stratified Split - Training')
axes[1,1].bar(range(len(np.bincount(y_test_strat))), np.bincount(y_test_strat))
axes[1,1].set_title('Stratified Split - Test')
plt.tight_layout()
plt.show()
return (X_train_strat, X_test_strat, y_train_strat, y_test_strat)
# Create imbalanced dataset example
from sklearn.datasets import make_classification
X_imb, y_imb = make_classification(
n_samples=1000, n_features=20, n_redundant=10,
n_classes=3, weights=[0.7, 0.2, 0.1], random_state=42
)
print(f"Original class distribution: {np.bincount(y_imb)}")
X_train_final, X_test_final, y_train_final, y_test_final = stratified_split_analysis(X_imb, y_imb)Cross-validation represents one of the most important innovations in statistical learning, providing a principled approach to the bias-variance trade-off in performance estimation.
Theoretical Foundation
Cross-validation addresses the fundamental problem that using the same data for both training and evaluation leads to optimistically biased performance estimates. The CV estimator:
CV_k = (1/k) Σ_{i=1}^k L(f^{-i}, D_i)
Where f^{-i} is trained on all data except fold i, and D_i is the i-th fold, provides a nearly unbiased estimate of generalization error.
Statistical Properties of Different CV Strategies
K-Fold Cross-Validation - Bias:
E[CV_k] “�� E[R(f)] when k is large - Variance:
Var[CV_k] increases with k due to overlapping training sets
- Optimal k: Often k=5 or k=10 balances bias and
variance
Leave-One-Out (LOO) Cross-Validation - Bias: Minimal (uses n-1 samples for training) - Variance: High due to maximum overlap between training sets - Computational: O(n) times more expensive than single split - Special Property: For certain algorithms (like KNN), LOO has analytical solutions
Mathematical Analysis of CV Bias and Variance
The bias-variance decomposition of k-fold CV shows:
MSE[CV_k] = Bias²[CV_k] + Variance[CV_k]
Where: - Bias²[CV_k] “�� (1/k) ×
(training_set_size_penalty)
- Variance[CV_k] “�� (k-1)/k ×
(correlation_between_folds)
This explains why k=5 often provides the best bias-variance trade-off.
Nested Cross-Validation for Hyperparameter Selection
When hyperparameter tuning is involved, we need nested CV to avoid selection bias:
Outer CV: Estimates generalization performance Inner CV: Selects optimal hyperparameters
Without this nesting, performance estimates are optimistically biased because hyperparameters are chosen to minimize CV error on the test folds.
Cross-validation provides a more robust evaluation by using multiple train-test splits while maintaining rigorous statistical properties.
from sklearn.model_selection import (
cross_val_score, KFold, StratifiedKFold,
LeaveOneOut, ShuffleSplit
)
from sklearn.ensemble import RandomForestClassifier
def compare_cv_strategies(X, y):
"""Compare different cross-validation strategies"""
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv_strategies = {
'K-Fold (5)': KFold(n_splits=5, shuffle=True, random_state=42),
'Stratified K-Fold (5)': StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
'Leave-One-Out': LeaveOneOut(),
'Shuffle Split (10)': ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
}
results = {}
print("🔄 Cross-Validation Comparison:")
print("=" * 50)
for name, cv in cv_strategies.items():
if name == 'Leave-One-Out' and len(X) > 100:
print(f"{name}: Skipped (too many samples)")
continue
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
results[name] = scores
print(f"{name}:")
print(f" Scores: {scores.round(3)}")
print(f" Mean: {scores.mean():.3f} ± {scores.std():.3f}")
print()
# Visualize results
plt.figure(figsize=(12, 6))
# Box plot of CV scores
plt.subplot(1, 2, 1)
data_to_plot = [scores for scores in results.values()]
labels = list(results.keys())
plt.boxplot(data_to_plot, labels=labels)
plt.xticks(rotation=45)
plt.ylabel('Accuracy Score')
plt.title('Cross-Validation Score Distribution')
# Mean and std comparison
plt.subplot(1, 2, 2)
means = [scores.mean() for scores in results.values()]
stds = [scores.std() for scores in results.values()]
x_pos = np.arange(len(labels))
plt.bar(x_pos, means, yerr=stds, capsize=5, alpha=0.7)
plt.xticks(x_pos, labels, rotation=45)
plt.ylabel('Mean Accuracy ± Std')
plt.title('Cross-Validation Performance')
plt.tight_layout()
plt.show()
return results
# Test with iris dataset
cv_results = compare_cv_strategies(X, y)For time series data, we need to respect temporal order when splitting.
from sklearn.model_selection import TimeSeriesSplit
def time_series_split_demo():
"""Demonstrate time series splitting"""
# Create sample time series data
dates = pd.date_range('2020-01-01', periods=100, freq='D')
values = np.cumsum(np.random.randn(100)) + 100 # Random walk
ts_data = pd.DataFrame({
'date': dates,
'value': values,
'feature1': np.random.randn(100),
'feature2': np.random.randn(100)
})
# Time series split
tscv = TimeSeriesSplit(n_splits=5)
plt.figure(figsize=(15, 8))
for i, (train_idx, test_idx) in enumerate(tscv.split(ts_data)):
plt.subplot(2, 3, i+1)
# Plot training data
plt.plot(ts_data.iloc[train_idx]['date'], ts_data.iloc[train_idx]['value'],
'b-', label='Training', alpha=0.7)
# Plot test data
plt.plot(ts_data.iloc[test_idx]['date'], ts_data.iloc[test_idx]['value'],
'r-', label='Test', alpha=0.7)
plt.title(f'Split {i+1}')
plt.xticks(rotation=45)
if i == 0:
plt.legend()
plt.subplot(2, 3, 6)
plt.text(0.1, 0.5,
"Time Series Cross-Validation:\n\n"
"✅ Respects temporal order\n"
"✅ Training always before test\n"
"✅ No data leakage from future\n"
"✅ Each split uses more training data",
transform=plt.gca().transAxes,
fontsize=12,
verticalalignment='center')
plt.axis('off')
plt.tight_layout()
plt.show()
return ts_data
ts_example = time_series_split_demo()After preprocessing, it’s crucial to validate that your data meets the requirements for machine learning.
def validate_processed_data(X_train, X_test, y_train, y_test):
"""Comprehensive data validation"""
print("✅ DATA VALIDATION CHECKLIST")
print("=" * 40)
# 1. Shape consistency
print(f"1. Shape Consistency:")
print(f" X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f" X_test: {X_test.shape}, y_test: {y_test.shape}")
feature_match = X_train.shape[1] == X_test.shape[1]
sample_match = X_train.shape[0] == len(y_train) and X_test.shape[0] == len(y_test)
print(f" ✅ Features match: {feature_match}")
print(f" ✅ Samples match: {sample_match}")
# 2. Missing values check
train_missing = np.isnan(X_train).sum() if isinstance(X_train, np.ndarray) else X_train.isnull().sum().sum()
test_missing = np.isnan(X_test).sum() if isinstance(X_test, np.ndarray) else X_test.isnull().sum().sum()
print(f"\n2. Missing Values:")
print(f" Training set: {train_missing}")
print(f" Test set: {test_missing}")
print(f" ✅ No missing values: {train_missing == 0 and test_missing == 0}")
# 3. Data type consistency
if hasattr(X_train, 'dtypes'):
train_types = set(X_train.dtypes)
test_types = set(X_test.dtypes)
type_match = train_types == test_types
print(f"\n3. Data Types:")
print(f" Types consistent: ✅ {type_match}")
# 4. Value ranges
if isinstance(X_train, np.ndarray):
X_train_df = pd.DataFrame(X_train)
X_test_df = pd.DataFrame(X_test)
else:
X_train_df, X_test_df = X_train, X_test
print(f"\n4. Value Ranges:")
for col in X_train_df.columns:
train_range = (X_train_df[col].min(), X_train_df[col].max())
test_range = (X_test_df[col].min(), X_test_df[col].max())
# Check if test range is within training range (approximately)
reasonable_range = (test_range[0] >= train_range[0] * 0.8 and
test_range[1] <= train_range[1] * 1.2)
print(f" Column {col}: Train{train_range}, Test{test_range} ✅ {reasonable_range}")
# 5. Class distribution (for classification)
if len(np.unique(y_train)) < 20: # Assume classification if few unique values
print(f"\n5. Class Distribution:")
train_dist = np.bincount(y_train) / len(y_train)
test_dist = np.bincount(y_test) / len(y_test)
print(f" Training: {train_dist.round(3)}")
print(f" Test: {test_dist.round(3)}")
# Check if distributions are similar (within 10% difference)
dist_similar = np.allclose(train_dist, test_dist, atol=0.1)
print(f" ✅ Similar distributions: {dist_similar}")
return {
'shape_consistent': feature_match and sample_match,
'no_missing': train_missing == 0 and test_missing == 0,
'ready_for_ml': feature_match and sample_match and train_missing == 0 and test_missing == 0
}
# Example validation
validation_results = validate_processed_data(X_train_final, X_test_final,
y_train_final, y_test_final)
print(f"\n🎯 Overall Assessment: {'Ready for ML!' if validation_results['ready_for_ml'] else 'Needs more work'}")Many machine learning algorithms are sensitive to the scale of input features, making normalization and standardization critical preprocessing steps. Understanding the mathematical rationale helps us choose the appropriate scaling method.
The Scale Sensitivity Problem
Consider a dataset with features of vastly different scales: - Age:
[20, 65] (range “�� 45)
- Income: [30000, 200000] (range”�� 170000) - Years of education: [12,
20] (range “�� 8)
Distance-based algorithms (KNN, SVM, clustering) will be dominated by the income feature simply due to its large numerical range, not its predictive importance. This violates the assumption that all features should contribute equally to similarity calculations.
Mathematical Justification for Scaling
For algorithms that use Euclidean distance, the distance between two points is:
d(x”��, x”��) = “��(Σᵢ (x”��ᵢ - x”��ᵢ)²)
Without scaling, features with larger ranges contribute disproportionately to this distance calculation. This creates a mathematical bias toward high-variance features.
Standardization (Z-score Normalization)
Standardization transforms features to have zero mean and unit variance:
z = (x - μ) / σ
Where: - μ = sample mean = (1/n) Σᵢ xᵢ
- σ = sample standard deviation = “��[(1/n) Σᵢ (xᵢ -
μ)²]
Mathematical Properties: - E[z] = 0 (zero mean) - Var[z] = 1 (unit variance) - Preserves shape of original distribution - Robust to outliers: No (outliers affect μ and σ)
Min-Max Normalization
Min-max scaling transforms features to a fixed range, typically [0,1]:
x_norm = (x - min(x)) / (max(x) - min(x))
Mathematical Properties: - Range: [0, 1] (or any specified [a, b]) - Preserves relationships: Linear transformation maintains relative distances - Outlier sensitivity: High (min and max are affected by outliers)
Robust Scaling
Uses median and interquartile range instead of mean and standard deviation:
x_robust = (x - median(x)) / IQR(x)
Where IQR = Q”�� - Q”�� (75th percentile - 25th percentile)
Mathematical Properties: - Robust to
outliers: Uses median and IQR
- No guarantee of fixed range: Unlike min-max scaling -
Preserves distribution shape: Better than standard
scaling for skewed data
When to Use Each Method
Creating reusable preprocessing pipelines ensures consistency and makes your workflow more maintainable.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
class DataPreprocessor:
"""Complete data preprocessing pipeline"""
def __init__(self, numeric_strategy='mean', categorical_strategy='most_frequent'):
self.numeric_strategy = numeric_strategy
self.categorical_strategy = categorical_strategy
self.preprocessor = None
self.label_encoder = LabelEncoder()
def fit(self, X, y=None):
"""Fit the preprocessing pipeline"""
# Identify column types
numeric_features = X.select_dtypes(include=[np.number]).columns
categorical_features = X.select_dtypes(include=['object']).columns
# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy=self.numeric_strategy)),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy=self.categorical_strategy)),
('encoder', LabelEncoder()) # Note: In practice, use OneHotEncoder
])
# Combine preprocessors
self.preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
self.preprocessor.fit(X)
if y is not None:
self.label_encoder.fit(y)
return self
def transform(self, X, y=None):
"""Transform the data"""
X_transformed = self.preprocessor.transform(X)
if y is not None:
y_transformed = self.label_encoder.transform(y)
return X_transformed, y_transformed
return X_transformed
def fit_transform(self, X, y=None):
"""Fit and transform in one step"""
return self.fit(X, y).transform(X, y)
def get_feature_names(self):
"""Get feature names after transformation"""
return self.preprocessor.get_feature_names_out()
# Example usage
sample_messy_data = pd.DataFrame({
'age': [25, None, 35, 40, 150], # Has missing and outlier
'income': [50000, 60000, None, 80000, 55000], # Has missing
'education': ['Bachelor', 'Master', None, 'PhD', 'Bachelor'], # Has missing
'target': ['A', 'B', 'A', 'C', 'B']
})
# Separate features and target
X = sample_messy_data.drop('target', axis=1)
y = sample_messy_data['target']
# Apply preprocessing
preprocessor = DataPreprocessor()
X_processed, y_processed = preprocessor.fit_transform(X, y)
print("📊 Preprocessing Results:")
print(f"Original shape: {X.shape}")
print(f"Processed shape: {X_processed.shape}")
print(f"Feature names: {preprocessor.get_feature_names()}")
print(f"Processed data sample:\n{X_processed[:3]}")1. The Principle of Statistical Independence
All preprocessing decisions must maintain the independence between training and test data. Mathematically, this means:
P(preprocess | test_data) = P(preprocess | training_data_only)
Any violation of this principle leads to data leakage, where information from the test set influences the preprocessing, creating optimistically biased performance estimates.
2. Distribution Preservation Principle
Preprocessing should preserve the essential statistical properties of the data generating process. For any preprocessing function f():
P(y | f(X)) should approximate P(y | X)
This ensures that learned relationships remain valid after transformation.
3. Stationarity Assumption
Preprocessing parameters (means, variances, encodings) should be stable across train/test splits:
θ_preprocessing^train “�� θ_preprocessing^test
Significant differences suggest non-stationary data or inadequate sampling.
1. Exploratory Data Analysis First - Understand distributional assumptions before choosing preprocessing methods - Test for stationarity, normality, and independence - Identify the data generating mechanism to inform preprocessing choices
2. Missing Data Mechanism Analysis - Statistically test for MCAR using Little’s test - Choose imputation methods based on missing data theory - Validate that imputation preserves important relationships
3. Preprocessing Parameter Estimation - Fit all preprocessing parameters only on training data - Use cross-validation to validate preprocessing choices - Monitor parameter stability across different train/test splits
4. Pipeline Validation and Monitoring - Implement statistical tests for preprocessing invariants - Monitor distribution drift in production - Version preprocessing transformations with data
1. Data Leakage: A Statistical Violation
Data leakage occurs when preprocessing uses information that wouldn’t be available in practice. Mathematically, this creates:
P̂(performance | train + test info) > P(performance | train info only)
The performance estimate becomes optimistically biased because the model indirectly accesses test set information through preprocessing parameters.
Example: Fitting a scaler on the entire dataset - Wrong: scaler.fit(X_entire) - Mathematical Problem: Test set statistics influence training set normalization - Result: Performance estimates are inflated by ~5-15%
2. Preprocessing Inconsistency: Distribution Shift
When train and test preprocessing differs, we create artificial distribution shift:
P(X_train_processed) “�� P(X_test_processed)
This violates the fundamental assumption that train and test data come from the same distribution.
3. Over-preprocessing: Information Loss
Excessive preprocessing can remove signal along with noise. The trade-off is:
Total Error = Bias² + Variance + Irreducible Error
Over-preprocessing can reduce variance (noise) but increase bias (signal loss), potentially worsening overall performance.
4. Statistical Type Errors
Treating data types incorrectly creates mathematical inconsistencies:
Each violation changes the mathematical structure that algorithms assume, leading to suboptimal learning.
# Example of data leakage (WRONG way)
def wrong_preprocessing(X, y):
"""Example of what NOT to do"""
# WRONG: Fitting imputer on entire dataset
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X) # Should only fit on training data
# WRONG: Scaling entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed) # Should only fit on training data
# Then splitting
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
return X_train, X_test, y_train, y_test
# Correct way
def correct_preprocessing(X, y):
"""Example of correct preprocessing"""
# CORRECT: Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# CORRECT: Fit on training data only
imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test) # Only transform, not fit
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed) # Only transform, not fit
return X_train_scaled, X_test_scaled, y_train, y_test
print("✅ Always remember: Fit on training data, transform on both!")Mathematical Necessity: Data preprocessing isn’t just practical housekeeping✅it’s mathematically required to satisfy algorithmic assumptions and optimize the learning objective function.
Statistical Missing Data Theory: Understanding Rubin’s missing data mechanisms (MCAR, MAR, MNAR) is crucial for choosing valid imputation strategies and maintaining statistical inference properties.
Generalization Theory: Proper dataset splitting and cross-validation are grounded in statistical learning theory, specifically the bias-variance decomposition of generalization error.
Scale Invariance: Feature scaling addresses mathematical biases in distance-based algorithms, ensuring that feature importance is determined by predictive value, not numerical scale.
Statistical Independence: All preprocessing must maintain independence between training and test data to preserve the validity of performance estimates.
Information Preservation: The goal is to reduce noise while preserving signal, balancing the bias-variance trade-off inherent in any data transformation.
In Chapter 3: Feature Engineering, you’ll learn how to: - Create new features from existing data - Select the most informative features - Apply dimensionality reduction techniques - Engineer domain-specific features - Evaluate feature importance
Data preprocessing sets the foundation, but feature engineering is where you can truly unlock the potential hidden in your data!
Create a comprehensive data quality report for a messy dataset: 1. Load a real-world dataset (from Kaggle or UCI repository) 2. Identify all data quality issues 3. Create visualizations showing the problems 4. Propose solutions for each issue
Compare different missing data handling strategies: 1. Create a dataset with different types of missing data (MCAR, MAR, MNAR) 2. Apply various imputation methods 3. Evaluate the impact on model performance 4. Determine which strategy works best for each scenario
Build a complete preprocessing pipeline: 1. Handle mixed data types (numeric, categorical, dates) 2. Include outlier detection and handling 3. Implement proper train-test splitting 4. Add data validation checks 5. Make it reusable for new datasets
Data preprocessing might not be glamorous, but it’s the foundation upon which all successful machine learning projects are built. Master these skills, and you’ll be well-equipped to handle real-world data challenges! # Chapter 3: Feature Engineering
“Feature engineering is often the difference between a good model and a great model.”
✅ Anonymous Data Scientist
By the end of this chapter, you’ll master: - Feature scaling and normalization techniques - Various feature selection methods - Feature extraction using PCA and LDA - Advanced feature engineering strategies - Feature importance evaluation and interpretation
Feature engineering lies at the heart of statistical learning theory, representing the crucial transformation from raw observations to informative representations that enable effective pattern recognition. From an information-theoretic perspective, the goal is to maximize the mutual information between features and targets while minimizing redundancy.
Information-Theoretic Foundation
Consider the fundamental relationship between features X and target Y. The mutual information I(X;Y) quantifies how much knowing X reduces uncertainty about Y:
I(X;Y) = H(Y) - H(Y|X)
Where H(Y) is the entropy of Y and H(Y|X) is the conditional entropy. Feature engineering aims to find transformations f(X) that maximize I(f(X);Y).
The Feature Representation Problem
In statistical learning, we assume data is generated by some unknown process P(X,Y). The challenge is that raw features X_raw may not provide the optimal representation for learning this relationship. Feature engineering seeks transformations:
X_engineered = φ(X_raw)
Such that a learning algorithm can more easily approximate the true function:
**f*: X_engineered “�� Y**
This is analogous to basis functions in functional analysis✅we’re finding a representation space where the target function has desirable properties (linearity, smoothness, separability).
Think of features as coordinates in a multi-dimensional space where our learning algorithm searches for patterns. Just as choosing the right coordinate system can make calculus problems trivial or impossible, choosing the right features can make prediction problems learnable or intractable.
Feature engineering fundamentally changes the learning problem’s statistical properties. The choice of features affects three critical aspects of model performance: bias, variance, and computational complexity.
The Bias-Variance-Complexity Trade-off
Poor features increase model bias by making the true relationship harder to approximate. Rich features can reduce bias but increase variance by providing more ways to overfit. The optimal feature set minimizes:
Total Error = Bias² + Variance + Irreducible Error + Computational Cost
Dimensionality and Sample Complexity
The curse of dimensionality shows that sample complexity grows exponentially with irrelevant dimensions. If we have d features, we typically need O(2^d) samples to maintain the same generalization performance. Feature engineering helps by:
Consider two scenarios with different feature representations:
Scenario 1: Raw Features
# Raw customer data
customer_data = {
'purchase_date': '2023-05-15',
'birth_date': '1990-03-22',
'purchase_amount': 150.75,
'last_purchase': '2023-04-10'
}Scenario 2: Engineered Features
# Engineered features from the same data
engineered_features = {
'customer_age': 33,
'days_since_last_purchase': 35,
'purchase_amount_log': 5.015,
'is_weekend_purchase': True,
'purchase_frequency': 2.4 # purchases per month
}The engineered features provide much more predictive power because they capture relationships and patterns that raw data doesn’t reveal directly.
Feature scaling addresses fundamental mathematical issues in optimization and distance computation. Many algorithms implicitly assume that all features contribute equally to similarity measures or gradient computations.
Distance-Based Algorithm Bias
Consider the Euclidean distance between two points in feature space:
d(x”��, x”��) = “��(Σᵢ (x”��ᵢ - x”��ᵢ)²)
Without scaling, features with larger numerical ranges dominate this calculation. For a dataset with features of scales [1, 1000], the second feature contributes 10”�� times more to distance calculations than the first, regardless of predictive importance.
Gradient-Based Optimization Issues
In gradient descent, the update rule is:
**θᵢ “�� θᵢ - α”��_θᵢ J(θ)**
When features have different scales, the gradients “��_θᵢ have different magnitudes. This creates an ill-conditioned optimization problem where: - Parameters for large-scale features have small gradients (slow learning) - Parameters for small-scale features have large gradients (unstable learning)
Condition Number and Convergence
The condition number κ of a matrix measures how difficult the optimization problem is. For unscaled features:
κ = λ_max / λ_min
Where λ are eigenvalues of the Hessian matrix. Poor scaling leads to high condition numbers, requiring more iterations for convergence and increasing numerical instability.
Machine learning algorithms often struggle when features have vastly different scales. Consider this dataset:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Example: House features with different scales
house_data = pd.DataFrame({
'price': [250000, 300000, 180000, 450000, 320000],
'sqft': [1200, 1500, 900, 2200, 1600],
'bedrooms': [2, 3, 2, 4, 3],
'age_years': [10, 5, 25, 2, 8]
})
print("Original data ranges:")
print(house_data.describe())The price ranges from 180,000 to 450,000, while bedrooms only range from 2 to 4. This scale difference can cause problems for algorithms like KNN or neural networks.
Standardization transforms features to have mean = 0 and standard deviation = 1.
Formula: z = (x - μ) / σ
def demonstrate_standardization():
"""Show standardization in action"""
# Apply standardization
scaler = StandardScaler()
standardized_data = scaler.fit_transform(house_data)
# Convert back to DataFrame for readability
standardized_df = pd.DataFrame(
standardized_data,
columns=house_data.columns
)
print("After standardization:")
print(standardized_df.round(3))
print(f"\nMeans: {standardized_df.mean().round(3)}")
print(f"Standard deviations: {standardized_df.std().round(3)}")
return standardized_df
# Example output shows all features centered around 0When to use: Most algorithms (SVM, Neural Networks, PCA) when you want to preserve the shape of the distribution.
Scales features to a fixed range, typically [0, 1].
Formula:
x_scaled = (x - x_min) / (x_max - x_min)
def demonstrate_minmax_scaling():
"""Show Min-Max scaling in action"""
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(house_data)
scaled_df = pd.DataFrame(scaled_data, columns=house_data.columns)
print("After Min-Max scaling:")
print(scaled_df.round(3))
print(f"\nMinimums: {scaled_df.min()}")
print(f"Maximums: {scaled_df.max()}")
return scaled_df
# All features now range from 0 to 1When to use: When you know the approximate upper and lower bounds of your data, or when you need a specific range.
Uses median and interquartile range, less sensitive to outliers.
Formula:
x_scaled = (x - median) / IQR
def demonstrate_robust_scaling():
"""Show Robust scaling with outliers"""
# Add outlier to demonstrate robustness
data_with_outlier = house_data.copy()
data_with_outlier.loc[5] = [1000000, 1400, 3, 15] # Expensive outlier
print("Data with outlier:")
print(data_with_outlier)
# Compare Standard vs Robust scaling
standard_scaler = StandardScaler()
robust_scaler = RobustScaler()
standard_scaled = standard_scaler.fit_transform(data_with_outlier)
robust_scaled = robust_scaler.fit_transform(data_with_outlier)
print("\nStandard scaling (affected by outlier):")
print(pd.DataFrame(standard_scaled, columns=house_data.columns).round(3))
print("\nRobust scaling (less affected by outlier):")
print(pd.DataFrame(robust_scaled, columns=house_data.columns).round(3))
# Robust scaling handles outliers betterWhen to use: When your data contains outliers that you want to preserve but not let dominate the scaling.
import matplotlib.pyplot as plt
def visualize_scaling_methods():
"""Compare different scaling methods visually"""
# Generate sample data with different distributions
np.random.seed(42)
data = pd.DataFrame({
'normal': np.random.normal(100, 15, 1000),
'exponential': np.random.exponential(2, 1000),
'uniform': np.random.uniform(0, 50, 1000)
})
# Apply different scaling methods
scalers = {
'Original': None,
'StandardScaler': StandardScaler(),
'MinMaxScaler': MinMaxScaler(),
'RobustScaler': RobustScaler()
}
fig, axes = plt.subplots(4, 3, figsize=(15, 12))
for i, (scaler_name, scaler) in enumerate(scalers.items()):
if scaler is None:
scaled_data = data
else:
scaled_data = pd.DataFrame(
scaler.fit_transform(data),
columns=data.columns
)
for j, column in enumerate(data.columns):
axes[i, j].hist(scaled_data[column], bins=50, alpha=0.7)
axes[i, j].set_title(f'{scaler_name} - {column}')
axes[i, j].set_xlabel('Value')
axes[i, j].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# This shows how each method affects different distributionsFeature selection addresses the fundamental challenge of identifying which variables carry genuine predictive signal versus those that contribute only noise or redundancy. This process is grounded in information theory, statistical inference, and computational complexity theory.
Information-Theoretic Perspective
From an information theory standpoint, we seek features that maximize mutual information with the target while minimizing redundancy among themselves:
Objective: max I(X_selected; Y) - λ × Redundancy(X_selected)
Where: - I(X_selected; Y) measures predictive
information - Redundancy(X_selected) penalizes
correlated features
- λ controls the trade-off between relevance and
redundancy
The Curse of Dimensionality: Mathematical Analysis
High-dimensional spaces exhibit counterintuitive properties that hurt learning:
Statistical Learning Theory of Feature Selection
The generalization bound for a model with d features is approximately:
R(h) “�� R̂(h) + O(”��(d log(n)/n))
Where R(h) is true risk, R̂(h) is empirical risk, and n is sample size. This shows that excess features directly worsen generalization bounds.
Three Categories of Feature Selection
Each approach represents different trade-offs between computational cost and optimality.
def demonstrate_curse_of_dimensionality():
"""Show how irrelevant features hurt performance"""
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Create dataset with increasing numbers of irrelevant features
n_relevant = 5
irrelevant_features = [0, 10, 50, 100, 500]
results = []
for n_irrelevant in irrelevant_features:
# Generate data
X, y = make_classification(
n_samples=1000,
n_features=n_relevant + n_irrelevant,
n_informative=n_relevant,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)
# Train and evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
accuracy = accuracy_score(y_test, clf.predict(X_test))
results.append((n_relevant + n_irrelevant, accuracy))
print(f"Total features: {n_relevant + n_irrelevant:3d}, Accuracy: {accuracy:.3f}")
return results
# Shows how adding irrelevant features decreases performanceFilter methods apply statistical hypothesis tests to measure the strength of association between features and targets. These methods are computationally efficient because they evaluate each feature independently of the learning algorithm.
Mathematical Foundation
Filter methods test the null hypothesis: H”�“: Feature X_i is independent of target Y
The p-value p_i measures the probability of observing the data under H”�“. Features with p_i < α (significance threshold) are considered relevant.
Common Statistical Tests for Filter Methods:
Filter methods provide fast, model-agnostic feature relevance assessment based on univariate statistical relationships.
def correlation_feature_selection(data, target, threshold=0.7):
"""Select features based on correlation with target and among themselves"""
# Calculate correlation with target
target_corr = data.corrwith(target).abs().sort_values(ascending=False)
print("Correlation with target:")
print(target_corr)
# Remove highly correlated features (multicollinearity)
corr_matrix = data.corr().abs()
# Find pairs of highly correlated features
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i):
if corr_matrix.iloc[i, j] > threshold:
colname_i = corr_matrix.columns[i]
colname_j = corr_matrix.columns[j]
high_corr_pairs.append((colname_i, colname_j, corr_matrix.iloc[i, j]))
print(f"\nHighly correlated pairs (>{threshold}):")
for pair in high_corr_pairs:
print(f"{pair[0]} - {pair[1]}: {pair[2]:.3f}")
return target_corr, high_corr_pairs
# Example usage with house data
# target_corr, high_corr = correlation_feature_selection(house_data, target_prices)from sklearn.feature_selection import chi2, SelectKBest
def chi_square_selection(X_categorical, y, k=5):
"""Select categorical features using Chi-square test"""
# Apply Chi-square test
chi2_selector = SelectKBest(chi2, k=k)
X_selected = chi2_selector.fit_transform(X_categorical, y)
# Get feature scores
feature_scores = chi2_selector.scores_
selected_features = chi2_selector.get_support(indices=True)
print("Chi-square scores:")
for i, score in enumerate(feature_scores):
status = "✅" if i in selected_features else "✅"
print(f"Feature {i}: {score:.3f} {status}")
return X_selected, selected_features
# Example with categorical data
def create_categorical_example():
"""Create example categorical data"""
np.random.seed(42)
data = pd.DataFrame({
'color': np.random.choice(['red', 'blue', 'green'], 1000),
'size': np.random.choice(['small', 'medium', 'large'], 1000),
'material': np.random.choice(['wood', 'metal', 'plastic'], 1000),
'brand': np.random.choice(['A', 'B', 'C', 'D'], 1000)
})
# Create target that depends on some features
target = (
(data['color'] == 'red').astype(int) +
(data['size'] == 'large').astype(int) +
np.random.randint(0, 2, 1000) # Add noise
) > 1
return data, targetWrapper methods treat feature selection as a discrete optimization problem, searching through the exponential space of feature subsets to find the combination that optimizes model performance.
Mathematical Formulation
The wrapper method optimization problem is:
S* = arg max_{S”��F} CV_score(Algorithm(X_S, y))
Where: - S is a feature subset from the full feature
set F - X_S contains only features in subset S
- CV_score is cross-validation performance - The search
space has 2^|F| possible subsets
Computational Complexity
Exhaustive search has exponential complexity O(2^d), making it intractable for large feature sets. Practical wrapper methods use heuristic search algorithms:
Search Strategy Analysis
Forward Selection Algorithm: 1. Start with empty feature set: S = “�� 2. At each step, add feature f that maximizes: CV_score(S”�� {f}) 3. Stop when performance plateaus or max features reached
Optimality Properties: - Not globally optimal: Greedy choices may miss better combinations - Monotone submodular approximation: Under certain conditions, achieves (1-1/e) of optimal - Local search guarantee: Finds local optimum in polynomial time
Wrapper methods are computationally expensive but model-specific, often yielding better performance than filter methods by accounting for feature interactions.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
def forward_selection_demo(X, y, max_features=5):
"""Demonstrate forward feature selection"""
# Create base estimator
estimator = LogisticRegression(random_state=42, max_iter=1000)
# Forward selection
forward_selector = SequentialFeatureSelector(
estimator,
n_features_to_select=max_features,
direction='forward',
cv=5,
scoring='accuracy'
)
# Fit and transform
X_selected = forward_selector.fit_transform(X, y)
selected_features = forward_selector.get_support(indices=True)
print("Forward Selection Results:")
print(f"Selected {len(selected_features)} features: {selected_features}")
print(f"Original shape: {X.shape}, Selected shape: {X_selected.shape}")
# Show selection process
print("\nFeature selection scores:")
for i, selected in enumerate(forward_selector.get_support()):
status = "✅ Selected" if selected else "✅ Not selected"
print(f"Feature {i}: {status}")
return X_selected, selected_features
# Manual implementation for educational purposes
def manual_forward_selection(X, y, max_features=5):
"""Manual implementation to understand the process"""
from sklearn.model_selection import cross_val_score
n_features = X.shape[1]
selected_features = []
remaining_features = list(range(n_features))
print("Forward Selection Process:")
print("=" * 50)
for step in range(max_features):
best_score = 0
best_feature = None
# Try adding each remaining feature
for feature in remaining_features:
current_features = selected_features + [feature]
X_subset = X.iloc[:, current_features]
# Cross-validate
estimator = LogisticRegression(random_state=42, max_iter=1000)
scores = cross_val_score(estimator, X_subset, y, cv=3)
avg_score = scores.mean()
if avg_score > best_score:
best_score = avg_score
best_feature = feature
# Add best feature
if best_feature is not None:
selected_features.append(best_feature)
remaining_features.remove(best_feature)
print(f"Step {step + 1}: Added feature {best_feature}, Score: {best_score:.4f}")
return selected_featuresdef backward_elimination_demo(X, y, min_features=3):
"""Demonstrate backward feature elimination"""
estimator = LogisticRegression(random_state=42, max_iter=1000)
# Backward elimination
backward_selector = SequentialFeatureSelector(
estimator,
n_features_to_select=min_features,
direction='backward',
cv=5,
scoring='accuracy'
)
X_selected = backward_selector.fit_transform(X, y)
selected_features = backward_selector.get_support(indices=True)
print("Backward Elimination Results:")
print(f"Kept {len(selected_features)} features: {selected_features}")
print(f"Original shape: {X.shape}, Final shape: {X_selected.shape}")
return X_selected, selected_features
def manual_backward_elimination(X, y, min_features=3):
"""Manual implementation of backward elimination"""
from sklearn.model_selection import cross_val_score
current_features = list(range(X.shape[1]))
print("Backward Elimination Process:")
print("=" * 50)
while len(current_features) > min_features:
worst_score = float('inf')
worst_feature = None
# Try removing each feature
for feature in current_features:
temp_features = [f for f in current_features if f != feature]
X_subset = X.iloc[:, temp_features]
estimator = LogisticRegression(random_state=42, max_iter=1000)
scores = cross_val_score(estimator, X_subset, y, cv=3)
avg_score = scores.mean()
# We want the removal that gives the best score (least impact)
if avg_score > worst_score:
worst_score = avg_score
worst_feature = feature
# Remove worst feature
if worst_feature is not None:
current_features.remove(worst_feature)
print(f"Removed feature {worst_feature}, Remaining score: {worst_score:.4f}")
return current_featuresfrom sklearn.feature_selection import RFE, RFECV
def rfe_demonstration(X, y, n_features=5):
"""Demonstrate Recursive Feature Elimination"""
# Basic RFE
estimator = LogisticRegression(random_state=42, max_iter=1000)
rfe = RFE(estimator, n_features_to_select=n_features)
X_rfe = rfe.fit_transform(X, y)
print("RFE Results:")
print(f"Selected {n_features} features")
print("Feature rankings:")
for i, (rank, support) in enumerate(zip(rfe.ranking_, rfe.support_)):
status = "✅ Selected" if support else f"✅ Rank {rank}"
print(f"Feature {i}: {status}")
return X_rfe, rfe.support_
def rfecv_demonstration(X, y):
"""RFE with Cross-Validation to find optimal number of features"""
estimator = LogisticRegression(random_state=42, max_iter=1000)
rfecv = RFECV(estimator, cv=5, scoring='accuracy')
X_rfecv = rfecv.fit_transform(X, y)
print("RFECV Results:")
print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {np.where(rfecv.support_)[0]}")
# Plot validation scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
rfecv.cv_results_['mean_test_score'], 'o-')
plt.xlabel('Number of Features')
plt.ylabel('Cross-Validation Score')
plt.title('RFE with Cross-Validation')
plt.axvline(x=rfecv.n_features_, color='red', linestyle='--',
label=f'Optimal: {rfecv.n_features_} features')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
return X_rfecv, rfecv.support_Embedded methods integrate feature selection directly into the learning objective through regularization, automatically identifying relevant features during optimization. This approach is grounded in sparsity theory and convex optimization.
Mathematical Foundation: Regularized Optimization
Embedded methods modify the learning objective by adding a regularization term that encourages sparsity:
min_θ L(θ; X, y) + λR(θ)
Where: - L(θ; X, y) is the loss function (e.g., MSE,
log-likelihood) - R(θ) is the regularization
penalty
- λ controls the sparsity-accuracy trade-off
L1 Regularization (Lasso): Mathematical Properties
The L1 penalty R(θ) = ||θ||“�� = Σᵢ|θᵢ| has unique theoretical properties:
Geometric Interpretation: The L1 constraint forms an L1-ball (diamond in 2D, hyperdiamond in higher dimensions) with corners on coordinate axes, encouraging sparse solutions.
Statistical Properties of Lasso
Under certain conditions (restricted eigenvalue condition), Lasso achieves:
**||θ̂ - θ*||“��”�� C”��(s log(p)/n)**
Where s is the sparsity level and p is the number of features. This bound shows Lasso can handle high-dimensional problems when the true model is sparse.
Embedded methods elegantly solve feature selection and parameter estimation simultaneously within a single optimization framework.
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
def lasso_feature_selection(X, y):
"""Use Lasso regularization for feature selection"""
# Find optimal alpha using cross-validation
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=1000)
lasso_cv.fit(X, y)
print("Lasso Feature Selection:")
print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
# Show coefficients
feature_importance = pd.DataFrame({
'feature': range(len(lasso_cv.coef_)),
'coefficient': lasso_cv.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print("\nFeature coefficients (sorted by absolute value):")
print(feature_importance)
# Select features with non-zero coefficients
selector = SelectFromModel(lasso_cv, prefit=True)
X_selected = selector.transform(X)
selected_features = selector.get_support(indices=True)
print(f"\nSelected {len(selected_features)} features with non-zero coefficients")
print(f"Selected feature indices: {selected_features}")
# Visualize coefficients
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.bar(range(len(lasso_cv.coef_)), lasso_cv.coef_)
plt.xlabel('Feature Index')
plt.ylabel('Coefficient Value')
plt.title('Lasso Coefficients')
plt.axhline(y=0, color='red', linestyle='--', alpha=0.5)
plt.subplot(1, 2, 2)
selected_coef = lasso_cv.coef_[selected_features]
plt.barh(range(len(selected_coef)), selected_coef)
plt.xlabel('Selected Feature Index')
plt.ylabel('Coefficient Value')
plt.title('Selected Features Coefficients')
plt.tight_layout()
plt.show()
return X_selected, selected_features, lasso_cv.coef_
def lasso_path_visualization(X, y):
"""Visualize how coefficients change with regularization strength"""
from sklearn.linear_model import lasso_path
alphas, coefs, _ = lasso_path(X, y, max_iter=1000)
plt.figure(figsize=(12, 8))
plt.plot(alphas, coefs.T)
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficient Value')
plt.title('Lasso Path - How Coefficients Change with Regularization')
plt.xscale('log')
plt.grid(True, alpha=0.3)
# Add vertical line for optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=1000)
lasso_cv.fit(X, y)
plt.axvline(x=lasso_cv.alpha_, color='red', linestyle='--',
label=f'Optimal α = {lasso_cv.alpha_:.4f}')
plt.legend()
plt.show()
return alphas, coefsfrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
def tree_based_feature_selection(X, y, method='random_forest'):
"""Use tree-based methods for feature importance"""
if method == 'random_forest':
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
name = "Random Forest"
elif method == 'extra_trees':
estimator = ExtraTreesClassifier(n_estimators=100, random_state=42)
name = "Extra Trees"
else:
estimator = DecisionTreeClassifier(random_state=42)
name = "Decision Tree"
# Fit and get feature importances
estimator.fit(X, y)
importances = estimator.feature_importances_
# Create importance DataFrame
importance_df = pd.DataFrame({
'feature': range(len(importances)),
'importance': importances
}).sort_values('importance', ascending=False)
print(f"{name} Feature Importance:")
print(importance_df)
# Select top features
selector = SelectFromModel(estimator, prefit=True)
X_selected = selector.transform(X)
selected_features = selector.get_support(indices=True)
print(f"\nSelected {len(selected_features)} important features")
# Visualize importance
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.bar(range(len(importances)), importances)
plt.xlabel('Feature Index')
plt.ylabel('Importance')
plt.title(f'{name} - All Features')
plt.subplot(1, 2, 2)
top_features = importance_df.head(10)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)),
[f'Feature {i}' for i in top_features['feature']])
plt.xlabel('Importance')
plt.title(f'{name} - Top 10 Features')
plt.tight_layout()
plt.show()
return X_selected, selected_features, importances
def feature_importance_comparison(X, y):
"""Compare feature importance across different methods"""
methods = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Extra Trees': ExtraTreesClassifier(n_estimators=100, random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42)
}
importance_comparison = pd.DataFrame(index=range(X.shape[1]))
for name, estimator in methods.items():
estimator.fit(X, y)
importance_comparison[name] = estimator.feature_importances_
print("Feature Importance Comparison:")
print(importance_comparison.round(4))
# Plot comparison
plt.figure(figsize=(12, 8))
importance_comparison.plot(kind='bar', ax=plt.gca())
plt.xlabel('Feature Index')
plt.ylabel('Importance')
plt.title('Feature Importance Comparison Across Methods')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
return importance_comparisonFeature extraction creates new features from existing ones, often reducing dimensionality while preserving important information. Unlike feature selection, which chooses from existing features, extraction transforms the original features into a new feature space.
PCA is fundamentally an eigenvalue problem that finds the optimal linear transformation for dimensionality reduction. It solves a constrained optimization problem to find directions of maximum variance in the data.
Mathematical Foundation: The Optimization Problem
PCA seeks to find orthonormal directions w”��, w”��, …, w”�� that maximize the variance of projected data:
max_{w”��,…,w”��} Σᵢ Var(Xwᵢ) subject to ||wᵢ|| = 1, wᵢ�“w”�� = 0 for i “�� j
This is equivalent to finding eigenvectors of the covariance matrix C = (1/n)X�“X.
Eigenvalue Decomposition Solution
The solution comes from the spectral theorem. For symmetric matrix C:
C = QΛQ�“
Where: - Q contains eigenvectors (principal
components)
- Λ contains eigenvalues (variance explained by each
component) - Components are ordered by decreasing eigenvalue: λ”�� “��
λ”�� “�� …”�� λ”��
Variance Preservation
The k-dimensional PCA projection preserves fraction of total variance:
Variance Ratio = (λ”�� + λ”�� + … + λ”��) / (λ”�� + λ”�� + … + λ”��)
Optimality Properties
PCA is optimal in several senses: 1. Maximum
variance: Maximizes variance of projected data 2.
Minimum reconstruction error: Minimizes ||X - X̂||²F
among all rank-k approximations
3. Maximum likelihood: Under Gaussian assumptions, PCA
is the ML solution
The mathematical elegance of PCA lies in connecting variance maximization, error minimization, and eigenvalue decomposition into a unified framework.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris, load_digits
import matplotlib.pyplot as plt
def pca_mathematical_explanation():
"""Explain PCA step by step mathematically"""
# Create simple 2D example
np.random.seed(42)
# Generate correlated data
mean = [0, 0]
cov = [[3, 2], [2, 2]] # Covariance matrix
data = np.random.multivariate_normal(mean, cov, 200)
print("PCA Step-by-Step:")
print("=" * 50)
# Step 1: Center the data
data_centered = data - np.mean(data, axis=0)
print(f"Step 1 - Data centered: Mean = {np.mean(data_centered, axis=0)}")
# Step 2: Compute covariance matrix
cov_matrix = np.cov(data_centered.T)
print(f"Step 2 - Covariance matrix:\n{cov_matrix}")
# Step 3: Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"Step 3 - Eigenvalues: {eigenvalues}")
print(f"Step 3 - Eigenvectors:\n{eigenvectors}")
# Step 4: Transform data
pca_data = data_centered.dot(eigenvectors)
print(f"Step 4 - Explained variance ratio: {eigenvalues / np.sum(eigenvalues)}")
# Visualize
plt.figure(figsize=(15, 5))
# Original data
plt.subplot(1, 3, 1)
plt.scatter(data[:, 0], data[:, 1], alpha=0.6)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
# Centered data with principal components
plt.subplot(1, 3, 2)
plt.scatter(data_centered[:, 0], data_centered[:, 1], alpha=0.6)
# Draw principal components
origin = np.array([0, 0])
plt.quiver(*origin, eigenvectors[0, 0], eigenvectors[1, 0],
scale=1, scale_units='xy', angles='xy', color='red', width=0.005, label='PC1')
plt.quiver(*origin, eigenvectors[0, 1], eigenvectors[1, 1],
scale=1, scale_units='xy', angles='xy', color='blue', width=0.005, label='PC2')
plt.title('Centered Data with Principal Components')
plt.xlabel('Feature 1 (centered)')
plt.ylabel('Feature 2 (centered)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
# Transformed data
plt.subplot(1, 3, 3)
plt.scatter(pca_data[:, 0], pca_data[:, 1], alpha=0.6)
plt.title('Data in PCA Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return data, pca_data, eigenvalues, eigenvectors
def pca_iris_example():
"""Comprehensive PCA example with Iris dataset"""
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
print("PCA on Iris Dataset:")
print("=" * 30)
print(f"Original shape: {X.shape}")
print(f"Features: {iris.feature_names}")
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X)
# Analyze components
print(f"\nExplained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative explained variance: {np.cumsum(pca.explained_variance_ratio_)}")
# Component interpretation
components_df = pd.DataFrame(
pca.components_.T,
columns=[f'PC{i+1}' for i in range(len(pca.components_))],
index=iris.feature_names
)
print(f"\nPrincipal Components (loadings):")
print(components_df.round(3))
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Scree plot
axes[0, 0].bar(range(1, len(pca.explained_variance_ratio_) + 1),
pca.explained_variance_ratio_)
axes[0, 0].plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'ro-')
axes[0, 0].set_xlabel('Principal Component')
axes[0, 0].set_ylabel('Explained Variance Ratio')
axes[0, 0].set_title('Scree Plot')
axes[0, 0].legend(['Cumulative', 'Individual'])
axes[0, 0].grid(True, alpha=0.3)
# PC1 vs PC2
scatter = axes[0, 1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
axes[0, 1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0, 1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[0, 1].set_title('First Two Principal Components')
plt.colorbar(scatter, ax=axes[0, 1])
# PC1 vs PC3
scatter = axes[0, 2].scatter(X_pca[:, 0], X_pca[:, 2], c=y, cmap='viridis')
axes[0, 2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0, 2].set_ylabel(f'PC3 ({pca.explained_variance_ratio_[2]:.1%} variance)')
axes[0, 2].set_title('PC1 vs PC3')
plt.colorbar(scatter, ax=axes[0, 2])
# Component loadings heatmap
im = axes[1, 0].imshow(pca.components_, cmap='RdBu', aspect='auto')
axes[1, 0].set_xticks(range(len(iris.feature_names)))
axes[1, 0].set_xticklabels(iris.feature_names, rotation=45)
axes[1, 0].set_yticks(range(len(pca.components_)))
axes[1, 0].set_yticklabels([f'PC{i+1}' for i in range(len(pca.components_))])
axes[1, 0].set_title('Component Loadings Heatmap')
plt.colorbar(im, ax=axes[1, 0])
# Individual feature contributions to PC1
pc1_contributions = np.abs(pca.components_[0])
axes[1, 1].barh(iris.feature_names, pc1_contributions)
axes[1, 1].set_xlabel('Absolute Loading')
axes[1, 1].set_title('Feature Contributions to PC1')
# 3D plot if possible
if len(pca.components_) >= 3:
ax = fig.add_subplot(2, 3, 6, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='viridis')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('3D PCA Visualization')
plt.tight_layout()
plt.show()
return X_pca, pca
def pca_dimensionality_reduction_analysis():
"""Analyze how much dimensionality reduction we can achieve"""
# Use digits dataset for high-dimensional example
digits = load_digits()
X = digits.data # 64 features (8x8 pixel images)
y = digits.target
print("Dimensionality Reduction Analysis:")
print("=" * 40)
print(f"Original dimensions: {X.shape}")
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X)
# Find number of components for different variance thresholds
cumsum_var = np.cumsum(pca.explained_variance_ratio_)
thresholds = [0.8, 0.9, 0.95, 0.99]
for threshold in thresholds:
n_components = np.argmax(cumsum_var >= threshold) + 1
compression_ratio = (X.shape[1] - n_components) / X.shape[1] * 100
print(f"{threshold:.0%} variance: {n_components} components "
f"({compression_ratio:.1f}% reduction)")
# Visualize
plt.figure(figsize=(15, 5))
# Cumulative explained variance
plt.subplot(1, 3, 1)
plt.plot(range(1, len(cumsum_var) + 1), cumsum_var, 'b-')
for threshold in thresholds:
n_comp = np.argmax(cumsum_var >= threshold) + 1
plt.axhline(y=threshold, color='red', linestyle='--', alpha=0.7)
plt.axvline(x=n_comp, color='red', linestyle='--', alpha=0.7)
plt.plot(n_comp, threshold, 'ro')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')
plt.grid(True, alpha=0.3)
# First few original images
plt.subplot(1, 3, 2)
sample_images = X[:16].reshape(16, 8, 8)
combined_image = np.zeros((4*8, 4*8))
for i in range(4):
for j in range(4):
combined_image[i*8:(i+1)*8, j*8:(j+1)*8] = sample_images[i*4 + j]
plt.imshow(combined_image, cmap='gray')
plt.title('Original Images (64D)')
plt.axis('off')
# Reconstructed images using reduced dimensions
plt.subplot(1, 3, 3)
n_components_reduced = np.argmax(cumsum_var >= 0.9) + 1
pca_reduced = PCA(n_components=n_components_reduced)
X_reduced = pca_reduced.fit_transform(X)
X_reconstructed = pca_reduced.inverse_transform(X_reduced)
reconstructed_images = X_reconstructed[:16].reshape(16, 8, 8)
combined_reconstructed = np.zeros((4*8, 4*8))
for i in range(4):
for j in range(4):
combined_reconstructed[i*8:(i+1)*8, j*8:(j+1)*8] = reconstructed_images[i*4 + j]
plt.imshow(combined_reconstructed, cmap='gray')
plt.title(f'Reconstructed ({n_components_reduced}D "�� 64D)')
plt.axis('off')
plt.tight_layout()
plt.show()
return cumsum_var, n_components_reducedLDA extends PCA to supervised dimensionality reduction by incorporating class information. Instead of maximizing total variance like PCA, LDA maximizes the ratio of between-class to within-class variance.
Mathematical Objective: Fisher’s Criterion
LDA solves the generalized eigenvalue problem to find projection directions that maximize:
J(w) = (w�“S_Bw) / (w�”S_Ww)
Where: - S_B = between-class scatter matrix = Σᵢ
nᵢ(μᵢ - μ)(μᵢ - μ)�” - S_W = within-class scatter
matrix = Σᵢ Σ”��“��Cᵢ (x - μᵢ)(x - μᵢ)�”
- nᵢ = number of samples in class i -
μᵢ = mean of class i, μ = overall
mean
Generalized Eigenvalue Solution
The optimal projection directions are eigenvectors of S_W”��¹S_B:
S_W”��¹S_B w = λw
The eigenvalues λ represent the discriminative power of each direction.
Dimensionality Constraints
LDA can find at most min(d, C-1) meaningful components, where: -
d = original feature dimensionality
- C = number of classes
This limitation arises because S_B has rank at most C-1.
Optimality Properties
LDA provides supervised dimensionality reduction specifically designed for classification tasks.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
def lda_explanation_and_demo():
"""Explain and demonstrate LDA for supervised dimensionality reduction"""
print("Linear Discriminant Analysis (LDA):")
print("=" * 40)
print("LDA vs PCA:")
print("- PCA: Unsupervised, maximizes variance")
print("- LDA: Supervised, maximizes class separability")
# Load Iris for comparison
iris = load_iris()
X = iris.data
y = iris.target
# Apply both PCA and LDA
pca = PCA(n_components=2)
lda = LinearDiscriminantAnalysis(n_components=2)
X_pca = pca.fit_transform(X)
X_lda = lda.fit_transform(X, y)
# Compare results
plt.figure(figsize=(15, 5))
# Original data (first two features)
plt.subplot(1, 3, 1)
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('Original Features')
plt.colorbar(scatter)
# PCA projection
plt.subplot(1, 3, 2)
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)')
plt.title('PCA Projection')
plt.colorbar(scatter)
# LDA projection
plt.subplot(1, 3, 3)
scatter = plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel(f'LD1 ({lda.explained_variance_ratio_[0]:.1%} var)')
plt.ylabel(f'LD2 ({lda.explained_variance_ratio_[1]:.1%} var)')
plt.title('LDA Projection')
plt.colorbar(scatter)
plt.tight_layout()
plt.show()
# Performance comparison
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
scores_original = cross_val_score(classifier, X, y, cv=5)
scores_pca = cross_val_score(classifier, X_pca, y, cv=5)
scores_lda = cross_val_score(classifier, X_lda, y, cv=5)
print(f"\nClassification Performance:")
print(f"Original features: {scores_original.mean():.3f} ± {scores_original.std():.3f}")
print(f"PCA (2D): {scores_pca.mean():.3f} ± {scores_pca.std():.3f}")
print(f"LDA (2D): {scores_lda.mean():.3f} ± {scores_lda.std():.3f}")
return X_pca, X_lda, pca, lda
def lda_mathematical_insight():
"""Show the mathematical insight behind LDA"""
# Generate synthetic data for clear demonstration
np.random.seed(42)
# Class 1: centered at (1, 1)
class1 = np.random.multivariate_normal([1, 1], [[0.3, 0.1], [0.1, 0.3]], 50)
# Class 2: centered at (3, 2)
class2 = np.random.multivariate_normal([3, 2], [[0.3, -0.1], [-0.1, 0.3]], 50)
# Class 3: centered at (1.5, 3)
class3 = np.random.multivariate_normal([1.5, 3], [[0.4, 0.0], [0.0, 0.2]], 50)
X = np.vstack([class1, class2, class3])
y = np.hstack([np.zeros(50), np.ones(50), np.full(50, 2)])
# Apply LDA
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(X, y)
# Calculate within-class and between-class scatter
def calculate_scatter_matrices(X, y):
"""Calculate within-class and between-class scatter matrices"""
classes = np.unique(y)
n_features = X.shape[1]
# Overall mean
mean_overall = np.mean(X, axis=0)
# Within-class scatter matrix
S_W = np.zeros((n_features, n_features))
# Between-class scatter matrix
S_B = np.zeros((n_features, n_features))
for c in classes:
X_c = X[y == c]
mean_c = np.mean(X_c, axis=0)
n_c = X_c.shape[0]
# Within-class scatter
S_W += np.cov(X_c.T) * (n_c - 1)
# Between-class scatter
mean_diff = (mean_c - mean_overall).reshape(-1, 1)
S_B += n_c * (mean_diff @ mean_diff.T)
return S_W, S_B, mean_overall
S_W, S_B, mean_overall = calculate_scatter_matrices(X, y)
print("LDA Mathematical Components:")
print("=" * 35)
print(f"Within-class scatter matrix S_W:\n{S_W.round(3)}")
print(f"\nBetween-class scatter matrix S_B:\n{S_B.round(3)}")
# LDA seeks to maximize: (w^T * S_B * w) / (w^T * S_W * w)
# This is solved by finding eigenvectors of S_W^(-1) * S_B
try:
eigenvals, eigenvecs = np.linalg.eig(np.linalg.inv(S_W) @ S_B)
idx = eigenvals.argsort()[::-1]
eigenvals = eigenvals[idx]
eigenvecs = eigenvecs[:, idx]
print(f"\nEigenvalues: {eigenvals.real.round(3)}")
print(f"LDA directions (eigenvectors):\n{eigenvecs.real.round(3)}")
except np.linalg.LinAlgError:
print("\nSingular matrix encountered - using pseudoinverse")
# Visualize the separability
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
colors = ['red', 'blue', 'green']
for i, color in enumerate(colors):
mask = y == i
plt.scatter(X[mask, 0], X[mask, 1], c=color, alpha=0.7, label=f'Class {i}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original 2D Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
for i, color in enumerate(colors):
mask = y == i
plt.scatter(X_lda[mask, 0], X_lda[mask, 1], c=color, alpha=0.7, label=f'Class {i}')
plt.xlabel('Linear Discriminant 1')
plt.ylabel('Linear Discriminant 2')
plt.title('LDA Projection - Maximized Separability')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return X_lda, S_W, S_BMutual information provides a principled, information-theoretic approach to measuring feature relevance. Unlike correlation, which only captures linear relationships, mutual information detects any statistical dependency between variables.
Information-Theoretic Foundation
Mutual information quantifies the reduction in uncertainty about Y when X is observed:
I(X;Y) = H(Y) - H(Y|X)
Where: - H(Y) = -Σᵧ p(y) log p(y) is the entropy of Y (uncertainty before observing X) - H(Y|X) = -Σ”��,ᵧ p(x,y) log p(y|x) is conditional entropy (uncertainty after observing X)
Alternative Formulation: Kullback-Leibler Divergence
Mutual information can be expressed as the KL divergence between joint and product distributions:
I(X;Y) = KL(P(X,Y) || P(X)P(Y)) = Σ”��,ᵧ p(x,y) log [p(x,y) / (p(x)p(y))]
This formulation highlights that MI measures how much the joint distribution deviates from independence.
Key Properties
Continuous Variable Extension
For continuous variables, MI uses differential entropy:
I(X;Y) = “��”�� p(x,y) log [p(x,y) / (p(x)p(y))] dx dy
In practice, this requires density estimation or discretization techniques.
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Calculate mutual information for classification
mi_scores = mutual_info_classif(X, y, random_state=42)
# Create feature importance plot
feature_names = data.feature_names
mi_df = pd.DataFrame({
'feature': feature_names,
'mutual_info': mi_scores
}).sort_values('mutual_info', ascending=False)
plt.figure(figsize=(12, 8))
plt.barh(range(len(mi_df.head(15))), mi_df.head(15)['mutual_info'])
plt.yticks(range(len(mi_df.head(15))), mi_df.head(15)['feature'])
plt.xlabel('Mutual Information Score')
plt.title('Top 15 Features by Mutual Information')
plt.tight_layout()
plt.show()
print("Top 10 features by mutual information:")
for i, (feature, score) in enumerate(mi_df.head(10).values):
print(f"{i+1:2d}. {feature:<25}: {score:.4f}")Advantages: - Captures non-linear relationships - Model-agnostic - No assumptions about data distribution
Limitations: - Computationally expensive for large datasets - Sensitive to discretization for continuous variables - May not capture complex interactions
Analysis of Variance (ANOVA) provides a statistical framework for testing whether features show significant differences across groups, making it valuable for feature selection in classification problems.
Mathematical Foundation: F-Statistic
The ANOVA F-test compares between-group variance to within-group variance:
F = MSB / MSW = (SSB/(k-1)) / (SSW/(N-k))
Where: - SSB = Sum of Squares Between groups = Σᵢ
nᵢ(x̄ᵢ - x̄)²
- SSW = Sum of Squares Within groups = ΣᵢΣ”��(xᵢ”�� -
x̄ᵢ)² - k = number of groups, N = total
sample size
Statistical Interpretation
Under null hypothesis H”�“: μ”�� = μ”�� = … = μ”�� (all group means equal), F follows F-distribution with (k-1, N-k) degrees of freedom. Large F-values indicate significant group differences, suggesting the feature is informative for classification.
Feature Selection via ANOVA
Features with F-statistic exceeding critical value F_α are selected: F_computed > F_α(k-1, N-k)
RFE implements a greedy backward elimination algorithm that iteratively removes the least important features according to a base estimator.
Algorithm: Backward Greedy Search
RFE with Cross-Validation (RFECV)
RFECV extends RFE by using cross-validation to determine optimal feature count:
n* = arg max_{n”��{1,2,…,p}} CV_score(RFE_n(F))
This addresses RFE’s limitation of requiring pre-specified target dimensionality.
Theoretical Properties
Tree-based models provide natural feature importance measures through impurity reduction calculations.
Gini Importance
For random forests, feature importance is the average decrease in Gini impurity:
Importance(f”��) = (1/B) Σᵦ Σ”�✅�T_b p(t) Δi(t,f”��)
Where: - B = number of trees, T_b =
nodes in tree b - p(t) = proportion of samples reaching
node t
- Δi(t,f”��) = impurity decrease when splitting on
feature f”�� at node t
SHAP values provide a unified framework for interpreting model predictions by quantifying the contribution of each feature to the prediction.
SHAP values are based on cooperative game theory. For a prediction f(x), the SHAP value ϕi for feature i is:
$$\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \cup \{i\}) - f(S)]$$
Where: - N is the set of all features - S is a subset of features not including i - f(S) is the model prediction using only features in subset S
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Calculate SHAP values
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test)
# Feature importance plot
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names,
plot_type="bar", show=False)
plt.title('Feature Importance (SHAP Values)')
plt.tight_layout()
plt.show()
# Detailed SHAP summary plot
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names, show=False)
plt.title('SHAP Summary Plot - Feature Impact on Predictions')
plt.tight_layout()
plt.show()
# Calculate mean absolute SHAP values for feature ranking
mean_shap = np.abs(shap_values[1]).mean(axis=0)
shap_df = pd.DataFrame({
'feature': feature_names,
'mean_shap': mean_shap
}).sort_values('mean_shap', ascending=False)
print("Top 10 features by SHAP importance:")
for i, (feature, importance) in enumerate(shap_df.head(10).values):
print(f"{i+1:2d}. {feature:<25}: {importance:.4f}")# Waterfall plot for a single prediction
shap.waterfall_plot(
explainer.expected_value[1],
shap_values[1][0],
X_test[0],
feature_names=feature_names,
show=False
)
plt.title('SHAP Waterfall Plot - Individual Prediction Explanation')
plt.tight_layout()
plt.show()from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import f_classif
# Calculate different types of feature importance
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Tree-based importance
tree_importance = rf_model.feature_importances_
# Statistical test (F-score)
f_scores, _ = f_classif(X_train, y_train)
f_scores_norm = f_scores / f_scores.max()
# Mutual information (already calculated)
mi_scores_norm = mi_scores / mi_scores.max()
# SHAP importance (already calculated)
shap_importance_norm = mean_shap / mean_shap.max()
# Create comparison dataframe
comparison_df = pd.DataFrame({
'feature': feature_names,
'tree_importance': tree_importance,
'f_score': f_scores_norm,
'mutual_info': mi_scores_norm,
'shap_importance': shap_importance_norm
})
# Plot comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
methods = ['tree_importance', 'f_score', 'mutual_info', 'shap_importance']
titles = ['Tree-based Importance', 'F-Score', 'Mutual Information', 'SHAP Importance']
for idx, (method, title) in enumerate(zip(methods, titles)):
ax = axes[idx//2, idx%2]
top_features = comparison_df.nlargest(15, method)
ax.barh(range(len(top_features)), top_features[method])
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'], fontsize=8)
ax.set_xlabel('Normalized Importance')
ax.set_title(title)
plt.tight_layout()
plt.show()
# Correlation between different importance measures
correlation_matrix = comparison_df[methods].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Between Different Feature Importance Methods')
plt.tight_layout()
plt.show()Let’s apply comprehensive feature engineering to a customer churn prediction problem.
# Simulate customer churn dataset
np.random.seed(42)
n_customers = 1000
# Generate synthetic customer data
customer_data = {
'tenure': np.random.exponential(24, n_customers),
'monthly_charges': np.random.normal(65, 20, n_customers),
'total_charges': np.random.normal(2000, 800, n_customers),
'age': np.random.normal(40, 15, n_customers),
'contract_length': np.random.choice([1, 12, 24], n_customers),
'payment_method': np.random.choice(['credit_card', 'bank_transfer', 'electronic_check'], n_customers),
'service_calls': np.random.poisson(2, n_customers),
'data_usage_gb': np.random.exponential(15, n_customers)
}
# Create target variable (churn) with realistic relationships
churn_prob = (
0.1 +
0.3 * (customer_data['service_calls'] > 5).astype(int) +
0.2 * (customer_data['tenure'] < 6).astype(int) +
0.15 * (customer_data['monthly_charges'] > 80).astype(int)
)
churn = np.random.binomial(1, np.clip(churn_prob, 0, 1), n_customers)
# Create DataFrame
df_churn = pd.DataFrame(customer_data)
df_churn['churn'] = churn
print("Dataset shape:", df_churn.shape)
print("\nChurn distribution:")
print(df_churn['churn'].value_counts(normalize=True))from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Define feature engineering steps
def create_feature_engineering_pipeline():
# Numerical features
numerical_features = ['tenure', 'monthly_charges', 'total_charges', 'age', 'data_usage_gb']
# Categorical features
categorical_features = ['contract_length', 'payment_method']
# Create new features
df_churn['avg_monthly_usage'] = df_churn['data_usage_gb'] / df_churn['tenure']
df_churn['charges_per_gb'] = df_churn['monthly_charges'] / (df_churn['data_usage_gb'] + 0.1)
df_churn['high_service_calls'] = (df_churn['service_calls'] > 3).astype(int)
df_churn['new_customer'] = (df_churn['tenure'] < 12).astype(int)
# Binning
df_churn['age_group'] = pd.cut(df_churn['age'],
bins=[0, 25, 35, 50, 100],
labels=['young', 'adult', 'middle_aged', 'senior'])
return df_churn
# Apply feature engineering
df_engineered = create_feature_engineering_pipeline()
print("New features created:")
new_features = ['avg_monthly_usage', 'charges_per_gb', 'high_service_calls', 'new_customer', 'age_group']
for feature in new_features:
print(f"- {feature}")from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Prepare features for modeling
X = df_engineered.drop(['churn'], axis=1)
y = df_engineered['churn']
# Encode categorical variables
le_payment = LabelEncoder()
le_age_group = LabelEncoder()
X_encoded = X.copy()
X_encoded['payment_method'] = le_payment.fit_transform(X['payment_method'])
X_encoded['age_group'] = le_age_group.fit_transform(X['age_group'])
# Scale numerical features
scaler = StandardScaler()
numerical_cols = ['tenure', 'monthly_charges', 'total_charges', 'age', 'data_usage_gb',
'avg_monthly_usage', 'charges_per_gb']
X_encoded[numerical_cols] = scaler.fit_transform(X_encoded[numerical_cols])
# Compare models with and without feature engineering
X_original = df_churn[['tenure', 'monthly_charges', 'total_charges', 'age', 'service_calls', 'data_usage_gb']]
X_original_scaled = scaler.fit_transform(X_original)
# Train models
models = {
'Random Forest (Original)': RandomForestClassifier(random_state=42),
'Random Forest (Engineered)': RandomForestClassifier(random_state=42),
'Logistic Regression (Original)': LogisticRegression(random_state=42),
'Logistic Regression (Engineered)': LogisticRegression(random_state=42)
}
datasets = {
'Random Forest (Original)': X_original_scaled,
'Random Forest (Engineered)': X_encoded,
'Logistic Regression (Original)': X_original_scaled,
'Logistic Regression (Engineered)': X_encoded
}
print("Model Performance Comparison:")
print("-" * 50)
for model_name, model in models.items():
X_data = datasets[model_name]
scores = cross_val_score(model, X_data, y, cv=5, scoring='accuracy')
print(f"{model_name:<30}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")# Example: Proper way to handle feature engineering in cross-validation
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
# WRONG: Fitting scaler on entire dataset
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X) # Data leakage!
# scores = cross_val_score(model, X_scaled, y, cv=5)
# CORRECT: Using pipeline to prevent data leakage
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(f_classif, k=10)),
('classifier', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5)
print(f"Cross-validation accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")1. Information-Theoretic Foundation: Feature engineering maximizes mutual information I(X;Y) between features and targets while minimizing redundancy, providing a principled approach to representation learning.
2. Mathematical Necessity of Scaling: Feature scaling addresses fundamental issues in optimization and distance computation, preventing scale-dependent biases and improving convergence properties of learning algorithms.
3. Statistical Learning Theory: The curse of dimensionality shows that generalization bounds worsen with irrelevant features, making feature selection mathematically essential for good performance.
4. Optimization Perspectives on Selection Methods: -
Filter methods: Efficient univariate statistical tests
(O(p) complexity) - Wrapper methods: Exponential search
problem solved with greedy heuristics
- Embedded methods: Sparsity-inducing regularization
integrates selection into learning
5. Linear Algebra Foundations of Extraction: - PCA: Eigenvalue decomposition optimizing variance preservation - LDA: Generalized eigenvalue problem optimizing class separability - Both provide mathematically optimal solutions to their respective objectives
6. Statistical Validation: All feature engineering must respect train-test independence to maintain valid generalization estimates and avoid optimistic bias in performance evaluation.
Load the Wine dataset and compare the performance of different scaling methods on a logistic regression classifier. Use cross-validation to get robust estimates.
Create a complete feature selection pipeline for the Breast Cancer dataset that: 1. Applies univariate selection (SelectKBest) 2. Uses recursive feature elimination 3. Compares results with tree-based feature importance 4. Evaluates the impact on model performance
Perform PCA on the Digits dataset and: 1. Plot the explained variance ratio 2. Determine how many components explain 95% of variance 3. Visualize the first two principal components 4. Compare classification accuracy with different numbers of components
Using the Boston Housing dataset: 1. Create polynomial features of degree 2 2. Apply different scaling methods 3. Use mutual information for feature selection 4. Compare model performance before and after feature engineering
Choose a dataset from your domain of interest and: 1. Perform comprehensive exploratory data analysis 2. Apply appropriate feature engineering techniques 3. Use multiple feature selection methods 4. Document your process and justify your choices 5. Evaluate the impact on model performance
This completes Chapter 3: Feature Engineering. The next chapter will cover Classification Algorithms, building on the feature engineering techniques learned here. # Chapter 4: Classification Algorithms
“The goal is to turn data into information, and information into insight.”
✅ Carly Fiorina
By the end of this chapter, you will be able to: - Understand the fundamentals of classification algorithms - Implement decision trees, KNN, SVM, and logistic regression - Evaluate classification model performance using appropriate metrics - Apply feature engineering techniques for classification problems - Compare different algorithms and select the best for specific problems - Build end-to-end classification pipelines
Classification represents one of the fundamental problems in statistical learning theory, where we seek to learn a mapping from input features to discrete output categories. The mathematical foundations draw from probability theory, decision theory, and statistical inference.
The Classification Learning Problem
Given a training dataset D = {(x”��, y”��), (x”��, y”��), …, (x”�‘, y”�’)} where xᵢ “��”�“ᵈ are feature vectors and yᵢ”�� {1, 2, …, K} are class labels, we want to learn a function:
f: “�”ᵈ “�� {1, 2, …, K}
That minimizes the expected prediction error on future, unseen data.
Bayes Optimal Classifier
The theoretically optimal classifier is the Bayes classifier, which assigns each point to the class with highest posterior probability:
**f*(x) = arg max_k P(Y = k | X = x)**
Using Bayes’ theorem: P(Y = k | X = x) = P(X = x | Y = k) × P(Y = k) / P(X = x)
The Bayes error rate represents the lowest achievable error rate for any classifier:
L* = 1 - E[max_k P(Y = k | X = x)]
Decision Boundaries and Complexity
Different classification algorithms make different assumptions about the decision boundary: - Linear classifiers: Assume linear decision boundaries - Nonlinear classifiers: Can learn complex, curved boundaries - Non-parametric methods: Make minimal distributional assumptions
Classification is a supervised learning task where we predict discrete class labels rather than continuous values, requiring specialized algorithms and evaluation metrics.
Predicting one of two possible outcomes: - Email Spam
Detection: Spam or Not Spam - Medical
Diagnosis: Disease Present or Absent
- Credit Approval: Approved or Rejected
Predicting one of multiple possible classes: - Image Recognition: Cat, Dog, Bird, etc. - Text Classification: Sports, Politics, Technology, etc. - Product Categorization: Electronics, Clothing, Books, etc.
Predicting multiple labels simultaneously: - Movie Genre: Action AND Comedy AND Drama - Medical Symptoms: Multiple conditions present - Document Tags: Multiple relevant topics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_regression
# Generate sample data
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Classification example
X_class, y_class = make_classification(n_samples=200, n_features=2,
n_redundant=0, n_informative=2,
n_clusters_per_class=1, random_state=42)
ax1.scatter(X_class[:, 0], X_class[:, 1], c=y_class, cmap='viridis')
ax1.set_title('Classification Problem\n(Discrete Classes)')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
# Regression example
X_reg, y_reg = make_regression(n_samples=200, n_features=1, noise=20, random_state=42)
ax2.scatter(X_reg, y_reg, alpha=0.6)
ax2.set_title('Regression Problem\n(Continuous Values)')
ax2.set_xlabel('Feature')
ax2.set_ylabel('Target Value')
plt.tight_layout()
plt.show()
print("Classification Output: Discrete classes (0, 1, 2, ...)")
print("Regression Output: Continuous values (1.5, 2.7, 10.3, ...)")Decision trees represent one of the most interpretable machine learning algorithms, grounded in information theory and recursive optimization. They construct hierarchical decision rules that partition the feature space into regions of high class purity.
Mathematical Foundation: Recursive Binary Partitioning
A decision tree recursively partitions the feature space X “��”�“ᵈ into disjoint regions R”��, R”��, …, R”�’ such that:
X = “��ᵢ”��“��ᵐ Rᵢ and Rᵢ”�� R”�� = “�� for i”�� j
Each region Rᵢ is associated with a class prediction ŷᵢ, typically the majority class within that region.
The Greedy Splitting Algorithm
At each node, the algorithm chooses the split that maximally reduces impurity:
(j, s) = arg max_{j,s} [N_parent × I(parent) - N_left × I(left) - N_right × I(right)]
Where: - j is the feature index, s is the split threshold - I(·) is an impurity measure (entropy, Gini, etc.) - N represents the number of samples in each node
Information-Theoretic Splitting Criteria
The choice of impurity measure determines the tree’s behavior:
Each criterion represents different ways of measuring node “purity” or class homogeneity.
Decision trees construct hierarchical rules through recursive feature space partitioning, choosing splits that maximize information gain or minimize impurity measures.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# Create a simple tennis dataset
tennis_data = {
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rainy', 'Rainy', 'Rainy',
'Overcast', 'Sunny', 'Sunny', 'Rainy', 'Sunny', 'Overcast',
'Overcast', 'Rainy'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool',
'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal',
'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
'Windy': ['False', 'True', 'False', 'False', 'False', 'True', 'True',
'False', 'False', 'False', 'True', 'True', 'False', 'True'],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df_tennis = pd.DataFrame(tennis_data)
print("Tennis Dataset:")
print(df_tennis.head(10))
# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
# Create encoders for each categorical column
encoders = {}
df_encoded = df_tennis.copy()
for column in df_tennis.columns:
if df_tennis[column].dtype == 'object':
encoders[column] = LabelEncoder()
df_encoded[column] = encoders[column].fit_transform(df_tennis[column])
X = df_encoded.drop('Play', axis=1)
y = df_encoded['Play']
print("\nEncoded Dataset:")
print(df_encoded.head())Decision tree splitting criteria are grounded in information theory and statistical measures of uncertainty. Understanding these mathematical foundations is crucial for algorithm selection and hyperparameter tuning.
Entropy: Measuring Information Content
Entropy H(S) quantifies the expected information content (uncertainty) in a dataset S:
H(S) = -Σᵢ”��“���” pᵢ log”��(pᵢ)
Mathematical Properties: - Maximum
entropy: H(S) = log”��(c) when all classes are equally likely -
Minimum entropy: H(S) = 0 when all samples belong to
one class
- Concavity: Entropy is a concave function, ensuring
unique maxima
Information Gain: Quantifying Split Quality
Information gain measures the reduction in entropy achieved by a split:
IG(S, A) = H(S) - Σᵥ”��Values(A) (|Sᵥ|/|S|) × H(Sᵥ)
Where Sᵥ is the subset of S where attribute A has value v.
Gini Impurity: Alternative Measure
Gini impurity provides a computationally efficient alternative:
Gini(S) = 1 - Σᵢ”��“���” pᵢ²
Mathematical comparison: - Entropy:
More theoretically principled (information-theoretic foundation) -
Gini: Computationally faster (no logarithms)
- Both: Concave functions that favor balanced
splits
Gain Ratio: Addressing Split Bias
Information gain is biased toward attributes with many values. Gain ratio normalizes by split information:
GainRatio(S, A) = IG(S, A) / SplitInfo(S, A)
Where SplitInfo(S, A) = -Σᵥ (|Sᵥ|/|S|) log”��(|Sᵥ|/|S|)
This prevents overfitting to high-cardinality categorical features.
$$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$
Where: - A is the attribute/feature - Sv is the subset of S where attribute A has value v
import numpy as np
from collections import Counter
def calculate_entropy(y):
"""Calculate entropy of a dataset"""
if len(y) == 0:
return 0
# Count occurrences of each class
counts = Counter(y)
total = len(y)
# Calculate entropy
entropy = 0
for count in counts.values():
if count > 0:
probability = count / total
entropy -= probability * np.log2(probability)
return entropy
def calculate_information_gain(X, y, feature_index):
"""Calculate information gain for a specific feature"""
# Calculate entropy of the original dataset
parent_entropy = calculate_entropy(y)
# Get unique values of the feature
feature_values = np.unique(X[:, feature_index])
weighted_entropy = 0
# Calculate weighted entropy after the split
for value in feature_values:
# Create subset where feature equals value
subset_indices = X[:, feature_index] == value
subset_y = y[subset_indices]
# Calculate weight and entropy of subset
weight = len(subset_y) / len(y)
subset_entropy = calculate_entropy(subset_y)
weighted_entropy += weight * subset_entropy
# Information gain = parent entropy - weighted entropy
information_gain = parent_entropy - weighted_entropy
return information_gain
# Calculate information gain for each feature
feature_names = ['Outlook', 'Temperature', 'Humidity', 'Windy']
X_array = X.values
y_array = y.values
print("Information Gain Analysis:")
print("-" * 40)
for i, feature in enumerate(feature_names):
ig = calculate_information_gain(X_array, y_array, i)
print(f"{feature:<12}: {ig:.4f}")
# Calculate base entropy
base_entropy = calculate_entropy(y_array)
print(f"\nBase Entropy: {base_entropy:.4f}")from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Train a decision tree
dt_classifier = DecisionTreeClassifier(
criterion='entropy', # Use information gain
random_state=42,
max_depth=3 # Limit depth to prevent overfitting
)
# Fit the model
dt_classifier.fit(X, y)
# Make predictions
predictions = dt_classifier.predict(X)
accuracy = accuracy_score(y, predictions)
print(f"Training Accuracy: {accuracy:.4f}")
# Visualize the decision tree
plt.figure(figsize=(15, 10))
plot_tree(dt_classifier,
feature_names=feature_names,
class_names=['No', 'Yes'],
filled=True,
rounded=True,
fontsize=12)
plt.title('Decision Tree for Tennis Playing Decision')
plt.show()
# Feature importance
feature_importance = dt_classifier.feature_importances_
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': feature_importance
}).sort_values('Importance', ascending=False)
print("\nFeature Importance:")
print(importance_df)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.title('Feature Importance in Decision Tree')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
# Load the Iris dataset
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
print("Iris Dataset Information:")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Dataset shape: {X_iris.shape}")
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)
# Train decision tree
dt_iris = DecisionTreeClassifier(
criterion='entropy',
random_state=42,
max_depth=3
)
dt_iris.fit(X_train, y_train)
# Make predictions
y_pred = dt_iris.predict(X_test)
# Evaluate performance
train_accuracy = dt_iris.score(X_train, y_train)
test_accuracy = dt_iris.score(X_test, y_test)
print(f"\nModel Performance:")
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")
# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.title('Confusion Matrix - Iris Classification')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Visualize the decision tree
plt.figure(figsize=(20, 12))
plot_tree(dt_iris,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True,
rounded=True,
fontsize=10)
plt.title('Decision Tree for Iris Species Classification')
plt.show()Decision trees can easily overfit, especially when they grow too deep. Let’s explore this concept:
from sklearn.model_selection import validation_curve
# Test different max_depth values
max_depths = range(1, 21)
train_scores, val_scores = validation_curve(
DecisionTreeClassifier(criterion='entropy', random_state=42),
X_iris, y_iris,
param_name='max_depth',
param_range=max_depths,
cv=5,
scoring='accuracy'
)
# Calculate means and standard deviations
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
# Plot validation curve
plt.figure(figsize=(10, 6))
plt.plot(max_depths, train_mean, 'o-', color='blue', label='Training Accuracy')
plt.fill_between(max_depths, train_mean - train_std, train_mean + train_std,
color='blue', alpha=0.1)
plt.plot(max_depths, val_mean, 'o-', color='red', label='Validation Accuracy')
plt.fill_between(max_depths, val_mean - val_std, val_mean + val_std,
color='red', alpha=0.1)
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve - Decision Tree Max Depth')
plt.legend()
plt.grid(True)
plt.show()
# Find optimal depth
optimal_depth = max_depths[np.argmax(val_mean)]
print(f"Optimal max_depth: {optimal_depth}")
print(f"Best validation accuracy: {val_mean[np.argmax(val_mean)]:.4f}")# Compare different pruning strategies
pruning_params = {
'No Pruning': {},
'Max Depth = 3': {'max_depth': 3},
'Min Samples Split = 20': {'min_samples_split': 20},
'Min Samples Leaf = 5': {'min_samples_leaf': 5},
'Max Features = 2': {'max_features': 2}
}
results = {}
for name, params in pruning_params.items():
# Create and train model
dt = DecisionTreeClassifier(criterion='entropy', random_state=42, **params)
dt.fit(X_train, y_train)
# Evaluate
train_acc = dt.score(X_train, y_train)
test_acc = dt.score(X_test, y_test)
results[name] = {
'Train Accuracy': train_acc,
'Test Accuracy': test_acc,
'Tree Depth': dt.get_depth(),
'Number of Leaves': dt.get_n_leaves()
}
# Display results
results_df = pd.DataFrame(results).T
print("Pruning Strategy Comparison:")
print(results_df.round(4))
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Training vs Test Accuracy
axes[0,0].bar(results_df.index, results_df['Train Accuracy'],
alpha=0.7, label='Train')
axes[0,0].bar(results_df.index, results_df['Test Accuracy'],
alpha=0.7, label='Test')
axes[0,0].set_title('Training vs Test Accuracy')
axes[0,0].set_ylabel('Accuracy')
axes[0,0].legend()
axes[0,0].tick_params(axis='x', rotation=45)
# Tree Depth
axes[0,1].bar(results_df.index, results_df['Tree Depth'])
axes[0,1].set_title('Tree Depth')
axes[0,1].set_ylabel('Depth')
axes[0,1].tick_params(axis='x', rotation=45)
# Number of Leaves
axes[1,0].bar(results_df.index, results_df['Number of Leaves'])
axes[1,0].set_title('Number of Leaves')
axes[1,0].set_ylabel('Leaves')
axes[1,0].tick_params(axis='x', rotation=45)
# Overfitting indicator (Train - Test accuracy)
overfitting = results_df['Train Accuracy'] - results_df['Test Accuracy']
axes[1,1].bar(results_df.index, overfitting)
axes[1,1].set_title('Overfitting Indicator\n(Train - Test Accuracy)')
axes[1,1].set_ylabel('Accuracy Difference')
axes[1,1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()Random Forest improves upon single decision trees by combining multiple trees, reducing overfitting and improving generalization.
For a Random Forest with T trees, the prediction is:
ŷ = mode{h1(x), h2(x), ..., hT(x)}
Where ht(x) is the prediction of the t-th tree.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import time
# Compare single decision tree vs random forest
models = {
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest (10 trees)': RandomForestClassifier(n_estimators=10, random_state=42),
'Random Forest (50 trees)': RandomForestClassifier(n_estimators=50, random_state=42),
'Random Forest (100 trees)': RandomForestClassifier(n_estimators=100, random_state=42)
}
comparison_results = {}
for name, model in models.items():
# Measure training time
start_time = time.time()
# Cross-validation
cv_scores = cross_val_score(model, X_iris, y_iris, cv=5, scoring='accuracy')
# Fit model for additional metrics
model.fit(X_train, y_train)
train_time = time.time() - start_time
# Store results
comparison_results[name] = {
'CV Mean': cv_scores.mean(),
'CV Std': cv_scores.std(),
'Training Time': train_time,
'Train Accuracy': model.score(X_train, y_train),
'Test Accuracy': model.score(X_test, y_test)
}
# Display comparison
comparison_df = pd.DataFrame(comparison_results).T
print("Decision Tree vs Random Forest Comparison:")
print(comparison_df.round(4))
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Cross-validation scores
axes[0].bar(comparison_df.index, comparison_df['CV Mean'])
axes[0].errorbar(range(len(comparison_df)), comparison_df['CV Mean'],
yerr=comparison_df['CV Std'], fmt='none', color='black', capsize=5)
axes[0].set_title('Cross-Validation Accuracy')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=45)
# Training time
axes[1].bar(comparison_df.index, comparison_df['Training Time'])
axes[1].set_title('Training Time')
axes[1].set_ylabel('Time (seconds)')
axes[1].tick_params(axis='x', rotation=45)
# Train vs Test accuracy
x_pos = np.arange(len(comparison_df))
width = 0.35
axes[2].bar(x_pos - width/2, comparison_df['Train Accuracy'],
width, label='Train', alpha=0.7)
axes[2].bar(x_pos + width/2, comparison_df['Test Accuracy'],
width, label='Test', alpha=0.7)
axes[2].set_title('Training vs Test Accuracy')
axes[2].set_ylabel('Accuracy')
axes[2].set_xticks(x_pos)
axes[2].set_xticklabels(comparison_df.index, rotation=45)
axes[2].legend()
plt.tight_layout()
plt.show()# Train a Random Forest and analyze feature importance
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Get feature importance
feature_importance = rf_model.feature_importances_
feature_names = iris.feature_names
# Create importance DataFrame
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': feature_importance
}).sort_values('Importance', ascending=False)
print("Random Forest Feature Importance:")
print(importance_df)
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(importance_df)), importance_df['Importance'])
plt.yticks(range(len(importance_df)), importance_df['Feature'])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance - Iris Dataset')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# Compare individual tree predictions
n_trees_to_show = 5
individual_predictions = []
for i in range(n_trees_to_show):
tree_pred = rf_model.estimators_[i].predict(X_test)
individual_predictions.append(tree_pred)
# Show first few predictions
print(f"\nFirst 10 test samples - Individual tree predictions:")
print("Sample | Tree1 | Tree2 | Tree3 | Tree4 | Tree5 | RF Pred | Actual")
print("-" * 70)
rf_pred = rf_model.predict(X_test)
for i in range(10):
tree_preds = " | ".join([f" {pred[i]} " for pred in individual_predictions])
print(f" {i:2d} | {tree_preds} | {rf_pred[i]} | {y_test[i]}")Advantages: - Easy to understand and interpret - Requires little data preparation - Handles both numerical and categorical data - Can capture non-linear relationships - Feature selection happens automatically
Disadvantages: - Prone to overfitting - Unstable (small data changes can result in different trees) - Biased toward features with more levels - Can create overly complex trees
Advantages: - Reduces overfitting compared to decision trees - More stable and robust - Provides feature importance - Handles missing values well - Works well with default parameters
Disadvantages: - Less interpretable than single trees - Can still overfit with very noisy data - Memory intensive for large datasets - Slower prediction than single trees
K-Nearest Neighbors represents a fundamental non-parametric approach to classification, based on the assumption of local smoothness in the data distribution. It embodies the principle that nearby points in feature space are likely to share the same class label.
Mathematical Foundation: Non-Parametric Density Estimation
KNN implicitly estimates the class conditional densities P(X|Y = k) using a non-parametric approach. For a query point x, the posterior probability is estimated as:
P̂(Y = k | X = x) = (1/K) Σᵢ”��N_K(x) I(yᵢ = k)
Where: - N_K(x) is the set of K nearest neighbors to x - I(yᵢ = k) is the indicator function (1 if yᵢ = k, 0 otherwise)
The Local Smoothness Assumption
KNN assumes that the target function f: X “�� Y is locally smooth, meaning:
If ||x”�� - x”��|| is small, then P(Y | X = x”��) “�� P(Y | X = x”��)
This assumption allows local interpolation to approximate the true posterior probabilities.
Consistency and Convergence Properties
Under mild regularity conditions, KNN is universally consistent:
As n “��”�� and K “��”�� such that K/n “�� 0, then P̂”�� P*
Where P* is the Bayes optimal classifier. This theoretical guarantee makes KNN a valuable baseline algorithm.
KNN is a “lazy learning” algorithm that defers computation until prediction time, storing the entire training dataset rather than learning parameters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create a simple 2D dataset for visualization
X_demo, y_demo = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1,
random_state=42)
# Visualize the concept
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
k_values = [1, 3, 7]
for idx, k in enumerate(k_values):
# Train KNN
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_demo, y_demo)
# Create decision boundary
h = 0.02
x_min, x_max = X_demo[:, 0].min() - 1, X_demo[:, 0].max() + 1
y_min, y_max = X_demo[:, 1].min() - 1, X_demo[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict on mesh
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
axes[idx].contourf(xx, yy, Z, alpha=0.4, cmap='viridis')
scatter = axes[idx].scatter(X_demo[:, 0], X_demo[:, 1], c=y_demo, cmap='viridis')
axes[idx].set_title(f'KNN with K={k}')
axes[idx].set_xlabel('Feature 1')
axes[idx].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()The choice of distance metric fundamentally determines the notion of “similarity” in KNN, directly affecting the algorithm’s inductive bias and performance characteristics.
Mathematical Requirements for Distance Metrics
A valid distance metric d(x, y) must satisfy the metric axioms:
Lp Norm Family
Most common distance metrics belong to the Lp norm family:
Lp(x, y) = (Σᵢ”��“��ᵈ |xᵢ - yᵢ|ᵖ)^(1/p)
Special Cases: - L”�� (Manhattan): d(x,y) = Σᵢ|xᵢ - yᵢ| (robust to outliers) - L”�� (Euclidean): d(x,y) = “��(Σᵢ(xᵢ - yᵢ)²) (most common) - L”�� (Chebyshev): d(x,y) = maxᵢ|xᵢ - yᵢ| (worst-case distance)
Mahalanobis Distance: Covariance-Aware Metric
Standard metrics assume feature independence and equal importance. Mahalanobis distance accounts for covariance:
d_M(x, y) = “��((x - y)�” Σ”��¹ (x - y))
Where Σ is the covariance matrix. This metric: - Normalizes by variance (automatic scaling) - Accounts for correlation between features - Reduces to Euclidean when features are independent and unit variance
from sklearn.metrics.pairwise import euclidean_distances, manhattan_distances, cosine_distances
# Sample points for distance calculation
point1 = np.array([1, 2])
point2 = np.array([4, 6])
point3 = np.array([2, 1])
points = np.array([point1, point2, point3])
print("Distance Calculations:")
print("Points:", points)
print()
# Euclidean Distance
euclidean_dist = euclidean_distances(points)
print("Euclidean Distance Matrix:")
print(euclidean_dist.round(3))
# Manhattan Distance
manhattan_dist = manhattan_distances(points)
print("\nManhattan Distance Matrix:")
print(manhattan_dist.round(3))
# Cosine Distance
cosine_dist = cosine_distances(points)
print("\nCosine Distance Matrix:")
print(cosine_dist.round(3))
# Manual calculation for understanding
def euclidean_distance(p1, p2):
return np.sqrt(np.sum((p1 - p2) ** 2))
def manhattan_distance(p1, p2):
return np.sum(np.abs(p1 - p2))
def cosine_distance(p1, p2):
dot_product = np.dot(p1, p2)
norms = np.linalg.norm(p1) * np.linalg.norm(p2)
return 1 - (dot_product / norms)
print(f"\nManual Euclidean distance between point1 and point2: {euclidean_distance(point1, point2):.3f}")
print(f"Manual Manhattan distance between point1 and point2: {manhattan_distance(point1, point2):.3f}")
print(f"Manual Cosine distance between point1 and point2: {cosine_distance(point1, point2):.3f}")# Visualize different distance metrics
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
distance_metrics = ['euclidean', 'manhattan', 'cosine']
for idx, metric in enumerate(distance_metrics):
knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn.fit(X_demo, y_demo)
# Create decision boundary
h = 0.02
x_min, x_max = X_demo[:, 0].min() - 1, X_demo[:, 0].max() + 1
y_min, y_max = X_demo[:, 1].min() - 1, X_demo[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[idx].contourf(xx, yy, Z, alpha=0.4, cmap='viridis')
axes[idx].scatter(X_demo[:, 0], X_demo[:, 1], c=y_demo, cmap='viridis')
axes[idx].set_title(f'KNN with {metric.capitalize()} Distance')
axes[idx].set_xlabel('Feature 1')
axes[idx].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()The choice of K is crucial for KNN performance:
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
# Use Iris dataset for K optimization
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)
# Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_iris)
X_test_scaled = scaler.transform(X_test_iris)
# Test different K values
k_values = range(1, 31)
cv_scores = []
train_scores = []
test_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
# Cross-validation score
cv_score = cross_val_score(knn, X_train_scaled, y_train_iris, cv=5).mean()
cv_scores.append(cv_score)
# Fit model for train/test scores
knn.fit(X_train_scaled, y_train_iris)
train_score = knn.score(X_train_scaled, y_train_iris)
test_score = knn.score(X_test_scaled, y_test_iris)
train_scores.append(train_score)
test_scores.append(test_score)
# Plot results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_values, cv_scores, 'o-', label='Cross-Validation Score')
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Cross-Validation Score vs K')
plt.grid(True)
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(k_values, train_scores, 'o-', label='Training Score', alpha=0.7)
plt.plot(k_values, test_scores, 'o-', label='Testing Score', alpha=0.7)
plt.xlabel('K Value')
plt.ylabel('Accuracy')
plt.title('Training vs Testing Score')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
# Find optimal K
optimal_k = k_values[np.argmax(cv_scores)]
print(f"Optimal K value: {optimal_k}")
print(f"Best cross-validation score: {max(cv_scores):.4f}")
# Compare odd vs even K values
odd_k_scores = [cv_scores[i] for i in range(len(k_values)) if k_values[i] % 2 == 1]
even_k_scores = [cv_scores[i] for i in range(len(k_values)) if k_values[i] % 2 == 0]
print(f"Average score for odd K: {np.mean(odd_k_scores):.4f}")
print(f"Average score for even K: {np.mean(even_k_scores):.4f}")Let’s implement KNN from scratch to understand the algorithm better:
class KNNFromScratch:
def __init__(self, k=3, distance_metric='euclidean'):
self.k = k
self.distance_metric = distance_metric
def fit(self, X, y):
"""Store training data"""
self.X_train = X
self.y_train = y
def _calculate_distance(self, x1, x2):
"""Calculate distance between two points"""
if self.distance_metric == 'euclidean':
return np.sqrt(np.sum((x1 - x2) ** 2))
elif self.distance_metric == 'manhattan':
return np.sum(np.abs(x1 - x2))
else:
raise ValueError("Unsupported distance metric")
def predict(self, X):
"""Make predictions for test data"""
predictions = []
for test_point in X:
# Calculate distances to all training points
distances = []
for train_point in self.X_train:
dist = self._calculate_distance(test_point, train_point)
distances.append(dist)
# Get indices of k nearest neighbors
k_indices = np.argsort(distances)[:self.k]
# Get labels of k nearest neighbors
k_nearest_labels = [self.y_train[i] for i in k_indices]
# Majority vote
prediction = max(set(k_nearest_labels), key=k_nearest_labels.count)
predictions.append(prediction)
return np.array(predictions)
def score(self, X, y):
"""Calculate accuracy"""
predictions = self.predict(X)
return np.mean(predictions == y)
# Test our implementation
knn_scratch = KNNFromScratch(k=3, distance_metric='euclidean')
knn_scratch.fit(X_train_scaled, y_train_iris)
# Compare with sklearn
knn_sklearn = KNeighborsClassifier(n_neighbors=3)
knn_sklearn.fit(X_train_scaled, y_train_iris)
scratch_accuracy = knn_scratch.score(X_test_scaled, y_test_iris)
sklearn_accuracy = knn_sklearn.score(X_test_scaled, y_test_iris)
print("KNN Implementation Comparison:")
print(f"From Scratch Accuracy: {scratch_accuracy:.4f}")
print(f"Scikit-learn Accuracy: {sklearn_accuracy:.4f}")
print(f"Difference: {abs(scratch_accuracy - sklearn_accuracy):.6f}")Support Vector Machines represent one of the most theoretically principled approaches to classification, grounded in statistical learning theory and convex optimization. SVMs find the optimal separating hyperplane that maximizes the margin between classes.
Geometric Intuition: Maximum Margin Principle
Given linearly separable data, infinitely many hyperplanes can separate the classes. SVM chooses the hyperplane that maximizes the margin - the distance to the nearest training examples.
For a hyperplane defined by w�“x + b = 0, the margin is 2/||w||.
Primal Optimization Problem
SVM solves the constrained optimization problem:
minimize_{w,b} (1/2)||w||²
subject to: yᵢ(w�“xᵢ + b)”�� 1, “�”i “�� {1,…,n}
This ensures all points are correctly classified with margin at least 1/||w||.
Soft-Margin SVM: Handling Non-Separable Data
For non-linearly separable data, we introduce slack variables ξᵢ:
minimize_{w,b,ξ} (1/2)||w||² + C Σᵢ ξᵢ
subject to: - yᵢ(w�“xᵢ + b)”�� 1 - ξᵢ, “�”i - ξᵢ “�� 0,”�“i
The parameter C controls the bias-variance trade-off: - Large C: Low bias, high variance (hard margin) - Small C: High bias, low variance (soft margin)
Dual Formulation and Support Vectors
Using Lagrange multipliers, the dual problem becomes:
maximize_α Σᵢ αᵢ - (1/2) Σᵢ Σ”�� αᵢα”��yᵢy”��“��xᵢ,x”��“��
subject to: - Σᵢ αᵢyᵢ = 0 - 0 “�� αᵢ”�� C, “�”i
Points with αᵢ > 0 are support vectors - they define the decision boundary.
The kernel trick enables SVMs to efficiently work in high-dimensional (even infinite-dimensional) feature spaces without explicitly computing the transformations.
Mathematical Foundation of Kernels
A kernel function K(x, z) implicitly defines a mapping φ: X “�� H to a Hilbert space H:
**K(x, z) = “��φ(x), φ(z)”��_H**
Mercer’s Theorem provides conditions for valid kernels: K must be positive semi-definite.
Common Kernel Functions and Their Properties
Kernel Selection Principles
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Create a pipeline with feature scaling and SVM
svm_pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42))
])
# Train the SVM
svm_pipeline.fit(X_train_iris, y_train_iris)
# Make predictions
y_pred_svm = svm_pipeline.predict(X_test_iris)
# Evaluate performance
svm_accuracy = accuracy_score(y_test_iris, y_pred_svm)
print(f"SVM Accuracy: {svm_accuracy:.4f}")
# Detailed classification report
print("\nSVM Classification Report:")
print(classification_report(y_test_iris, y_pred_svm, target_names=iris.target_names))
# Confusion Matrix
plt.figure(figsize=(8, 6))
cm_svm = confusion_matrix(y_test_iris, y_pred_svm)
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Blues',
xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.title('Confusion Matrix - SVM Iris Classification')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()# Compare different kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
svm_accuracies = {}
for kernel in kernels:
svm_model = SVC(kernel=kernel, C=1.0, gamma='scale', random_state=42)
svm_model.fit(X_train_iris, y_train_iris)
y_pred = svm_model.predict(X_test_iris)
accuracy = accuracy_score(y_test_iris, y_pred)
svm_accuracies[kernel] = accuracy
print(f"{kernel.capitalize()} Kernel SVM Accuracy: {accuracy:.4f}")
# Visualize kernel performance
plt.figure(figsize=(10, 6))
plt.bar(svm_accuracies.keys(), svm_accuracies.values())
plt.title('SVM Accuracy with Different Kernels')
plt.xlabel('Kernel Type')
plt.ylabel('Accuracy')
plt.ylim(0.8, 1.0)
plt.grid(axis='y')
plt.show()Logistic regression represents a fundamental probabilistic approach to classification, modeling the posterior class probabilities using the logistic function. It’s grounded in maximum likelihood estimation and provides well-calibrated probability estimates.
Probabilistic Foundation: Generalized Linear Models
Logistic regression belongs to the family of Generalized Linear Models (GLMs), where we model the relationship between features and target through a link function.
For binary classification, we model the log-odds (logit):
log(P(y=1|x) / P(y=0|x)) = w�“x + b
Solving for P(y=1|x) gives the logistic function:
P(y=1|x) = 1 / (1 + e^{-(w�“x + b)}) = σ(w�”x + b)
Statistical Interpretation
The linear combination w�“x + b represents the log-odds
ratio: - When w�”x + b = 0, P(y=1|x) = 0.5 (equal probability)
- When w�“x + b > 0, P(y=1|x) > 0.5 (favors class 1)
- When w�”x + b < 0, P(y=1|x) < 0.5 (favors class 0)
Decision Boundary
The decision boundary is the hyperplane where P(y=1|x) = 0.5:
w�“x + b = 0
This is linear in the original feature space, making logistic regression a linear classifier.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""Sigmoid activation function"""
return 1 / (1 + np.exp(-np.clip(z, -250, 250))) # Clip to prevent overflow
# Plot sigmoid function
z = np.linspace(-10, 10, 100)
y = sigmoid(z)
plt.figure(figsize=(10, 6))
plt.plot(z, y, 'b-', linewidth=2, label='Sigmoid Function')
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7, label='Decision Threshold')
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('z = w^T x + b')
plt.ylabel('P(y=1|x)')
plt.title('Sigmoid Function for Logistic Regression')
plt.grid(True, alpha=0.3)
plt.legend()
plt.ylim(0, 1)
plt.show()
print("Sigmoid Function Properties:")
print(f"sigmoid(0) = {sigmoid(0):.3f}")
print(f"sigmoid(-"��) "�� {sigmoid(-10):.6f}")
print(f"sigmoid(+"��) "�� {sigmoid(10):.6f}")Logistic regression parameters are estimated using Maximum Likelihood Estimation (MLE), which provides strong statistical foundations for the learning algorithm.
Likelihood Function
Assuming independence, the likelihood of observing the data is:
L(w) = “��ᵢ”��“��”�� P(yᵢ|xᵢ)^yᵢ × (1 - P(yᵢ|xᵢ))^{1-yᵢ}
Log-Likelihood (Easier to Optimize)
Taking the logarithm (monotonic transformation):
“��(w) = Σᵢ”��“��”�� [yᵢ log P(yᵢ|xᵢ) + (1-yᵢ) log(1 - P(yᵢ|xᵢ))]
Cost Function (Negative Log-Likelihood)
To convert to a minimization problem:
J(w) = -1/n × “��(w) = -1/n Σᵢ”��“��”�� [yᵢ log σ(w�“xᵢ) + (1-yᵢ) log(1 - σ(w�”xᵢ))]
Convexity and Global Optimization
The logistic regression cost function is strictly convex, guaranteeing: 1. Unique global minimum: No local minima 2. Gradient descent convergence: Always reaches the optimal solution 3. Well-defined MLE: Under mild regularity conditions
class LogisticRegressionFromScratch:
def __init__(self, learning_rate=0.01, max_iterations=1000):
self.learning_rate = learning_rate
self.max_iterations = max_iterations
def fit(self, X, y):
"""Train the logistic regression model"""
# Add bias term
m, n = X.shape
self.weights = np.zeros(n + 1)
X_with_bias = np.column_stack([np.ones(m), X])
# Store cost history
self.cost_history = []
for i in range(self.max_iterations):
# Forward pass
z = np.dot(X_with_bias, self.weights)
predictions = sigmoid(z)
# Compute cost
cost = self._compute_cost(y, predictions)
self.cost_history.append(cost)
# Backward pass (gradient descent)
gradient = np.dot(X_with_bias.T, (predictions - y)) / m
self.weights -= self.learning_rate * gradient
def _compute_cost(self, y_true, y_pred):
"""Compute logistic regression cost"""
# Clip predictions to prevent log(0)
y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
cost = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return cost
def predict_proba(self, X):
"""Predict class probabilities"""
m = X.shape[0]
X_with_bias = np.column_stack([np.ones(m), X])
return sigmoid(np.dot(X_with_bias, self.weights))
def predict(self, X):
"""Make predictions"""
return (self.predict_proba(X) >= 0.5).astype(int)
def score(self, X, y):
"""Calculate accuracy"""
predictions = self.predict(X)
return np.mean(predictions == y)
# Test implementation with binary classification
from sklearn.datasets import make_classification
X_binary, y_binary = make_classification(n_samples=1000, n_features=2,
n_redundant=0, n_informative=2,
n_clusters_per_class=1, random_state=42)
# Split data
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(
X_binary, y_binary, test_size=0.3, random_state=42
)
# Train custom implementation
lr_scratch = LogisticRegressionFromScratch(learning_rate=0.1, max_iterations=1000)
lr_scratch.fit(X_train_lr, y_train_lr)
# Train sklearn implementation
from sklearn.linear_model import LogisticRegression
lr_sklearn = LogisticRegression(random_state=42)
lr_sklearn.fit(X_train_lr, y_train_lr)
# Compare results
scratch_acc = lr_scratch.score(X_test_lr, y_test_lr)
sklearn_acc = lr_sklearn.score(X_test_lr, y_test_lr)
print("Logistic Regression Comparison:")
print(f"From Scratch Accuracy: {scratch_acc:.4f}")
print(f"Scikit-learn Accuracy: {sklearn_acc:.4f}")
print(f"Difference: {abs(scratch_acc - sklearn_acc):.6f}")
# Plot cost function convergence
plt.figure(figsize=(10, 6))
plt.plot(lr_scratch.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Logistic Regression Cost Function Convergence')
plt.grid(True, alpha=0.3)
plt.show()For multi-class problems, logistic regression uses strategies like One-vs-Rest or softmax regression.
For K classes, the softmax function is:
$$P(y=k|x) = \frac{e^{w_k^T x + b_k}}{\sum_{j=1}^{K} e^{w_j^T x + b_j}}$$
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Multi-class comparison on Iris dataset
multiclass_strategies = {
'Multinomial': LogisticRegression(multi_class='multinomial', random_state=42),
'One-vs-Rest': LogisticRegression(multi_class='ovr', random_state=42),
'OneVsRestClassifier': OneVsRestClassifier(LogisticRegression(random_state=42))
}
multiclass_results = {}
for name, classifier in multiclass_strategies.items():
classifier.fit(X_train_scaled, y_train_iris)
train_acc = classifier.score(X_train_scaled, y_train_iris)
test_acc = classifier.score(X_test_scaled, y_test_iris)
multiclass_results[name] = {
'Train Accuracy': train_acc,
'Test Accuracy': test_acc
}
multiclass_df = pd.DataFrame(multiclass_results).T
print("Multi-class Logistic Regression Comparison:")
print(multiclass_df.round(4))Regularization prevents overfitting by adding penalty terms to the cost function.
L1 (Lasso): $J(w) = J_0(w) + \lambda \sum_{i=1}^{n} |w_i|$
L2 (Ridge): $J(w) = J_0(w) + \lambda \sum_{i=1}^{n} w_i^2$
from sklearn.model_selection import validation_curve
# Compare different regularization strengths
C_values = np.logspace(-3, 3, 7) # C is inverse of regularization strength
# L1 Regularization
train_scores_l1, val_scores_l1 = validation_curve(
LogisticRegression(penalty='l1', solver='liblinear'),
X_train_scaled, y_train_iris,
param_name='C', param_range=C_values, cv=5
)
# L2 Regularization
train_scores_l2, val_scores_l2 = validation_curve(
LogisticRegression(penalty='l2'),
X_train_scaled, y_train_iris,
param_name='C', param_range=C_values, cv=5
)
# Plot validation curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# L1 Regularization
train_mean_l1 = np.mean(train_scores_l1, axis=1)
val_mean_l1 = np.mean(val_scores_l1, axis=1)
ax1.semilogx(C_values, train_mean_l1, 'o-', color='blue', label='Training')
ax1.semilogx(C_values, val_mean_l1, 'o-', color='red', label='Validation')
ax1.set_xlabel('C Parameter')
ax1.set_ylabel('Accuracy')
ax1.set_title('L1 Regularization (Lasso)')
ax1.legend()
ax1.grid(True)
# L2 Regularization
train_mean_l2 = np.mean(train_scores_l2, axis=1)
val_mean_l2 = np.mean(val_scores_l2, axis=1)
ax2.semilogx(C_values, train_mean_l2, 'o-', color='blue', label='Training')
ax2.semilogx(C_values, val_mean_l2, 'o-', color='red', label='Validation')
ax2.set_xlabel('C Parameter')
ax2.set_ylabel('Accuracy')
ax2.set_title('L2 Regularization (Ridge)')
ax2.legend()
ax2.grid(True)
plt.tight_layout()
plt.show()
# Find optimal C values
optimal_C_l1 = C_values[np.argmax(val_mean_l1)]
optimal_C_l2 = C_values[np.argmax(val_mean_l2)]
print(f"Optimal C for L1: {optimal_C_l1:.3f}")
print(f"Optimal C for L2: {optimal_C_l2:.3f}")Model evaluation in classification requires understanding the statistical properties of performance metrics and their interpretations. Each metric captures different aspects of model behavior, and the choice depends on the problem context and class distribution.
The Fundamental Evaluation Framework
Classification evaluation is based on the confusion matrix, which cross-tabulates predicted versus actual labels. For binary classification:
Predicted
0 1
Actual 0 TN FP
1 FN TP
Where: - TP (True Positives): Correctly predicted
positive cases - TN (True Negatives): Correctly
predicted negative cases
- FP (False Positives): Incorrectly predicted as
positive (Type I error) - FN (False Negatives):
Incorrectly predicted as negative (Type II error)
Statistical Interpretation of Errors
The confusion matrix forms the mathematical foundation for all classification metrics.
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import seaborn as sns
# Train multiple models for comparison
models = {
'Decision Tree': DecisionTreeClassifier(max_depth=3, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
'Logistic Regression': LogisticRegression(random_state=42)
}
# Train all models and collect predictions
model_predictions = {}
model_probabilities = {}
for name, model in models.items():
model.fit(X_train_scaled, y_train_iris)
predictions = model.predict(X_test_scaled)
probabilities = model.predict_proba(X_test_scaled)
model_predictions[name] = predictions
model_probabilities[name] = probabilities
# Create confusion matrices
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()
for idx, (name, predictions) in enumerate(model_predictions.items()):
cm = confusion_matrix(y_test_iris, predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=iris.target_names,
yticklabels=iris.target_names,
ax=axes[idx])
axes[idx].set_title(f'{name} - Confusion Matrix')
axes[idx].set_ylabel('Actual')
axes[idx].set_xlabel('Predicted')
plt.tight_layout()
plt.show()Accuracy: Overall Correctness
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy measures the proportion of correct predictions. However, it can be misleading with imbalanced classes due to the accuracy paradox.
Precision: Positive Predictive Value
Precision = TP / (TP + FP)
Precision answers: “Of all positive predictions, how many were actually correct?” High precision minimizes false alarms.
Recall (Sensitivity): True Positive Rate
Recall = TP / (TP + FN)
Recall answers: “Of all actual positives, how many did we correctly identify?” High recall minimizes missed detections.
F1-Score: Harmonic Mean of Precision and Recall
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1-score provides a balanced measure when precision and recall are both important. The harmonic mean penalizes extreme values more than arithmetic mean.
Specificity: True Negative Rate
Specificity = TN / (TN + FP)
Specificity measures the ability to correctly identify negative cases.
Statistical Trade-offs
There exists a fundamental precision-recall trade-off: improving one often decreases the other. The optimal balance depends on the relative costs of Type I and Type II errors in your application domain.
# Calculate comprehensive metrics for all models
evaluation_results = {}
for name, predictions in model_predictions.items():
metrics = {
'Accuracy': accuracy_score(y_test_iris, predictions),
'Precision (macro)': precision_score(y_test_iris, predictions, average='macro'),
'Recall (macro)': recall_score(y_test_iris, predictions, average='macro'),
'F1-Score (macro)': f1_score(y_test_iris, predictions, average='macro')
}
evaluation_results[name] = metrics
# Create comparison DataFrame
eval_df = pd.DataFrame(evaluation_results).T
print("Classification Metrics Comparison:")
print(eval_df.round(4))
# Visualize metrics comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
metrics = ['Accuracy', 'Precision (macro)', 'Recall (macro)', 'F1-Score (macro)']
for idx, metric in enumerate(metrics):
ax = axes[idx//2, idx%2]
values = eval_df[metric]
bars = ax.bar(range(len(values)), values, alpha=0.7)
ax.set_title(f'{metric} Comparison')
ax.set_ylabel(metric)
ax.set_xticks(range(len(values)))
ax.set_xticklabels(eval_df.index, rotation=45)
ax.set_ylim(0, 1.1)
# Add value labels on bars
for i, bar in enumerate(bars):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
f'{height:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()# Detailed classification reports
print("Detailed Classification Reports:")
print("=" * 60)
for name, predictions in model_predictions.items():
print(f"\n{name}:")
print("-" * 40)
report = classification_report(y_test_iris, predictions,
target_names=iris.target_names)
print(report)For binary and multi-class ROC analysis:
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from itertools import cycle
# Binarize the output for multi-class ROC
y_test_binarized = label_binarize(y_test_iris, classes=[0, 1, 2])
n_classes = y_test_binarized.shape[1]
# Plot ROC curves for each model
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for idx, (name, probabilities) in enumerate(model_probabilities.items()):
ax = axes[idx]
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_binarized[:, i], probabilities[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Plot ROC curves for each class
for i, color in zip(range(n_classes), colors):
ax.plot(fpr[i], tpr[i], color=color, lw=2,
label=f'ROC curve of class {iris.target_names[i]} (area = {roc_auc[i]:.2f})')
ax.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'{name} - ROC Curves')
ax.legend(loc="lower right", fontsize=8)
plt.tight_layout()
plt.show()
# Calculate average AUC for each model
avg_auc_scores = {}
for name, probabilities in model_probabilities.items():
auc_scores = []
for i in range(n_classes):
fpr, tpr, _ = roc_curve(y_test_binarized[:, i], probabilities[:, i])
auc_scores.append(auc(fpr, tpr))
avg_auc_scores[name] = np.mean(auc_scores)
print("Average AUC Scores:")
for name, score in sorted(avg_auc_scores.items(), key=lambda x: x[1], reverse=True):
print(f"{name:<20}: {score:.4f}")from sklearn.model_selection import cross_validate
from scipy import stats
# Perform comprehensive cross-validation
cv_results = {}
for name, model in models.items():
cv_result = cross_validate(model, X_iris, y_iris, cv=10,
scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
return_train_score=True)
cv_results[name] = cv_result
# Extract and display results
cv_summary = {}
for name, results in cv_results.items():
cv_summary[name] = {
'CV Accuracy': results['test_accuracy'].mean(),
'CV Accuracy Std': results['test_accuracy'].std(),
'CV Precision': results['test_precision_macro'].mean(),
'CV Recall': results['test_recall_macro'].mean(),
'CV F1-Score': results['test_f1_macro'].mean()
}
cv_df = pd.DataFrame(cv_summary).T
print("Cross-Validation Results (10-fold):")
print(cv_df.round(4))
# Statistical significance test (paired t-test)
print("\nPairwise Model Comparison (Accuracy):")
print("=" * 50)
model_names = list(cv_results.keys())
for i in range(len(model_names)):
for j in range(i+1, len(model_names)):
model1, model2 = model_names[i], model_names[j]
scores1 = cv_results[model1]['test_accuracy']
scores2 = cv_results[model2]['test_accuracy']
t_stat, p_value = stats.ttest_rel(scores1, scores2)
print(f"{model1} vs {model2}:")
print(f" Mean diff: {np.mean(scores1 - scores2):.4f}")
print(f" p-value: {p_value:.4f}")
print(f" Significant: {'Yes' if p_value < 0.05 else 'No'}")
print()from sklearn.model_selection import learning_curve
# Generate learning curves for best performing model
best_model_name = max(avg_auc_scores, key=avg_auc_scores.get)
best_model = models[best_model_name]
print(f"Generating learning curve for best model: {best_model_name}")
train_sizes, train_scores, val_scores = learning_curve(
best_model, X_iris, y_iris, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy', n_jobs=-1
)
# Calculate means and standard deviations
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
color='blue', alpha=0.1)
plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
color='red', alpha=0.1)
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy Score')
plt.title(f'Learning Curve - {best_model_name}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Check for overfitting/underfitting
final_gap = train_mean[-1] - val_mean[-1]
print(f"Final training-validation gap: {final_gap:.4f}")
if final_gap > 0.05:
print(✅�️ Possible overfitting detected")
elif val_mean[-1] < 0.8:
print(✅�️ Possible underfitting detected")
else:
print("✅ Model appears well-fitted")# Final comprehensive comparison of all algorithms
final_comparison = {}
algorithms = {
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5),
'SVM': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
'Logistic Regression': LogisticRegression(random_state=42)
}
# Evaluate on multiple datasets
from sklearn.datasets import load_breast_cancer, load_wine
datasets = {
'Iris': (X_iris, y_iris),
'Breast Cancer': load_breast_cancer(return_X_y=True),
'Wine': load_wine(return_X_y=True)
}
comparison_results = {}
for dataset_name, (X, y) in datasets.items():
print(f"Evaluating on {dataset_name} dataset...")
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
dataset_results = {}
for alg_name, algorithm in algorithms.items():
# Cross-validation
cv_scores = cross_val_score(algorithm, X_scaled, y, cv=5, scoring='accuracy')
dataset_results[alg_name] = {
'Mean CV Accuracy': cv_scores.mean(),
'Std CV Accuracy': cv_scores.std()
}
comparison_results[dataset_name] = dataset_results
# Display results
print("\nAlgorithm Performance Across Datasets:")
print("=" * 60)
for dataset_name, results in comparison_results.items():
print(f"\n{dataset_name} Dataset:")
df_temp = pd.DataFrame(results).T
print(df_temp.round(4))
# Find best algorithm for this dataset
best_alg = df_temp['Mean CV Accuracy'].idxmax()
best_score = df_temp.loc[best_alg, 'Mean CV Accuracy']
print(f"Best Algorithm: {best_alg} ({best_score:.4f})")
# Create visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for idx, (dataset_name, results) in enumerate(comparison_results.items()):
df_plot = pd.DataFrame(results).T
bars = axes[idx].bar(range(len(df_plot)), df_plot['Mean CV Accuracy'])
axes[idx].errorbar(range(len(df_plot)), df_plot['Mean CV Accuracy'],
yerr=df_plot['Std CV Accuracy'], fmt='none',
color='black', capsize=5)
axes[idx].set_title(f'{dataset_name} Dataset')
axes[idx].set_ylabel('Cross-Validation Accuracy')
axes[idx].set_xticks(range(len(df_plot)))
axes[idx].set_xticklabels(df_plot.index, rotation=45)
axes[idx].set_ylim(0, 1.1)
# Highlight best performer
best_idx = df_plot['Mean CV Accuracy'].argmax()
bars[best_idx].set_color('gold')
plt.tight_layout()
plt.show()| Algorithm | Best For | Pros | Cons |
|---|---|---|---|
| Decision Tree | Interpretable models, mixed data types | Easy to understand, handles missing values | Prone to overfitting, unstable |
| Random Forest | General-purpose, feature importance | Reduces overfitting, robust | Less interpretable, memory intensive |
| KNN | Small datasets, local patterns | Simple, no assumptions | Computationally expensive, sensitive to scaling |
| SVM | High-dimensional data, non-linear patterns | Effective in high dimensions, memory efficient | Slow on large datasets, requires scaling |
| Logistic Regression | Linear relationships, probability estimates | Fast, interpretable, probabilistic | Assumes linear relationships |
# Decision tree for algorithm selection
def recommend_algorithm(dataset_size, interpretability_needed, data_linearity, computational_budget):
"""
Simple algorithm recommendation system
"""
recommendations = []
if interpretability_needed == "high":
if dataset_size == "small":
recommendations.append("Decision Tree")
else:
recommendations.append("Logistic Regression")
if data_linearity == "linear":
recommendations.extend(["Logistic Regression", "SVM (linear)"])
else:
recommendations.extend(["Random Forest", "SVM (RBF)", "KNN"])
if computational_budget == "low":
recommendations = [alg for alg in recommendations if alg not in ["KNN", "SVM (RBF)"]]
recommendations.append("Logistic Regression")
if dataset_size == "large":
recommendations = [alg for alg in recommendations if alg != "KNN"]
return list(set(recommendations))
# Example recommendations
print("Algorithm Recommendation Examples:")
print("=" * 40)
scenarios = [
{"size": "small", "interpretability": "high", "linearity": "linear", "budget": "high"},
{"size": "large", "interpretability": "low", "linearity": "non-linear", "budget": "medium"},
{"size": "medium", "interpretability": "medium", "linearity": "unknown", "budget": "low"}
]
for i, scenario in enumerate(scenarios, 1):
recommendations = recommend_algorithm(
scenario["size"], scenario["interpretability"],
scenario["linearity"], scenario["budget"]
)
print(f"\nScenario {i}: {scenario}")
print(f"Recommended: {', '.join(recommendations)}")def classification_preprocessing_checklist():
checklist = [
""'� Handle missing values appropriately",
""'� Encode categorical variables (one-hot, label encoding)",
""'� Scale/normalize features (especially for KNN, SVM, LogReg)",
""'� Check for class imbalance",
""'� Remove or handle outliers",
""'� Feature selection/engineering",
""'� Split data properly (train/validation/test)",
""'� Ensure no data leakage"
]
print("Classification Data Preparation Checklist:")
print("=" * 45)
for item in checklist:
print(item)
classification_preprocessing_checklist()def model_selection_workflow():
steps = [
"1. Start with simple baselines (Logistic Regression, Decision Tree)",
"2. Try ensemble methods (Random Forest)",
"3. Experiment with different algorithms (SVM, KNN)",
"4. Tune hyperparameters using cross-validation",
"5. Evaluate using multiple metrics",
"6. Check for overfitting/underfitting",
"7. Test final model on held-out test set",
"8. Consider business constraints (interpretability, speed)"
]
print("Model Selection Workflow:")
print("=" * 30)
for step in steps:
print(step)
model_selection_workflow()1. Statistical Learning Foundation: Classification algorithms approximate the Bayes optimal classifier P(Y|X), each making different assumptions about the data generating process and decision boundaries.
2. Algorithm-Specific Theoretical Strengths: - Decision Trees: Information-theoretic splitting using entropy/Gini, naturally handle feature interactions - KNN: Non-parametric with universal consistency guarantees, assumes local smoothness - SVM: Maximum margin principle with optimal separating hyperplane, kernel trick for non-linearity - Logistic Regression: Probabilistic GLM with convex optimization and well-calibrated probabilities
3. Mathematical Evaluation Framework: Performance metrics derive from the confusion matrix, each capturing different aspects of Type I/II errors with statistical interpretation.
4. Bias-Variance Considerations: Different algorithms exhibit different bias-variance trade-offs - understanding these helps with algorithm selection and hyperparameter tuning.
5. Computational Complexity: Training complexities vary dramatically (O(n log n) for trees, O(n²) for KNN, O(n³) for SVM), affecting scalability decisions.
6. No Free Lunch Theorem: No universally best algorithm exists - optimal choice depends on data distribution, noise level, sample size, and interpretability requirements.
Using the Titanic dataset: 1. Build a decision tree to predict survival 2. Visualize the tree and interpret the rules 3. Compare different pruning strategies 4. Analyze feature importance
With the Wine dataset: 1. Find the optimal K using cross-validation 2. Compare different distance metrics 3. Analyze the effect of feature scaling 4. Implement weighted KNN from scratch
Using a synthetic non-linear dataset: 1. Create data that’s not linearly separable 2. Compare linear, polynomial, and RBF kernels 3. Tune hyperparameters using grid search 4. Visualize decision boundaries
Choose a real-world classification problem: 1. Apply all algorithms covered in this chapter 2. Perform proper preprocessing and feature engineering 3. Use appropriate evaluation metrics 4. Create a detailed comparison report 5. Justify your final algorithm choice
Using a highly imbalanced dataset: 1. Identify the class imbalance problem 2. Try different sampling techniques 3. Use appropriate evaluation metrics 4. Compare algorithm performance before and after handling imbalance
This completes Chapter 4: Classification Algorithms. The next chapter will cover Regression Algorithms, exploring continuous prediction problems and their evaluation methods. # Chapter 5: Regression Algorithms
“All models are wrong, but some are useful.”
✅ George E. P. Box
By the end of this chapter, you will be able to: - Understand the mathematical foundations of regression algorithms - Implement linear, polynomial, and regularized regression models - Evaluate regression model performance using appropriate metrics - Apply feature engineering techniques specific to regression problems - Handle real-world regression challenges like overfitting and multicollinearity - Build end-to-end regression pipelines for practical applications
Regression represents the cornerstone of statistical modeling, seeking to learn the conditional expectation E[Y|X] from observed data. The theoretical framework draws from probability theory, linear algebra, and optimization to provide both predictive power and inferential insights.
The Regression Learning Problem
Given training data D = {(x”��, y”��), (x”��, y”��), …, (x”�‘, y”�’)} where xᵢ “��”�“ᵈ are feature vectors and yᵢ”�� “�” are continuous targets, we seek to learn a function:
f: “�”ᵈ “��”�“
That minimizes the expected prediction error on future data.
Bias-Variance Decomposition in Regression
The expected prediction error can be decomposed into three fundamental components:
E[(Y - f̂(X))²] = Bias²[f̂(X)] + Var[f̂(X)] + σ²
Where: - Bias²: Error from incorrect model
assumptions - Variance: Error from sensitivity to
training data variations
- σ²: Irreducible noise in the data generating
process
This decomposition guides algorithm selection and regularization strategies.
Statistical Assumptions
Classical regression theory relies on several key assumptions: 1. Linearity: Relationship between predictors and target is linear 2. Independence: Observations are independent 3. Homoscedasticity: Constant error variance across all prediction levels 4. Normality: Errors follow a normal distribution (for inference)
Understanding when these assumptions hold✅and how to address violations✅is crucial for effective regression modeling.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression, load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
# Set style for consistent plotting
plt.style.use('default')
sns.set_palette("husl")
print("Common Regression Applications:")
print("=" * 50)
applications = {
"Real Estate": "Predicting house prices based on location, size, amenities",
"Finance": "Stock price forecasting, risk assessment, portfolio optimization",
"Marketing": "Sales prediction, customer lifetime value estimation",
"Healthcare": "Drug dosage optimization, treatment outcome prediction",
"Engineering": "Quality control, performance optimization, failure prediction",
"Economics": "GDP forecasting, inflation modeling, market analysis"
}
for domain, description in applications.items():
print(f"📊 {domain:<12}: {description}")
# Generate sample regression data for visualization
X_sample, y_sample = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
plt.figure(figsize=(12, 4))
# Simple regression example
plt.subplot(1, 3, 1)
plt.scatter(X_sample, y_sample, alpha=0.6)
reg = LinearRegression().fit(X_sample, y_sample)
plt.plot(X_sample, reg.predict(X_sample), color='red', linewidth=2)
plt.title('Linear Regression')
plt.xlabel('Feature')
plt.ylabel('Target')
# Polynomial regression example
plt.subplot(1, 3, 2)
X_poly = np.linspace(-3, 3, 100).reshape(-1, 1)
y_poly = 2 * X_poly.ravel()**2 + np.random.normal(0, 3, 100)
plt.scatter(X_poly, y_poly, alpha=0.6)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
poly_reg = Pipeline([('poly', PolynomialFeatures(2)), ('linear', LinearRegression())])
poly_reg.fit(X_poly, y_poly)
plt.plot(X_poly, poly_reg.predict(X_poly), color='red', linewidth=2)
plt.title('Polynomial Regression')
plt.xlabel('Feature')
plt.ylabel('Target')
# Multiple regression visualization
plt.subplot(1, 3, 3)
X_multi, y_multi = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)
reg_multi = LinearRegression().fit(X_multi, y_multi)
predicted = reg_multi.predict(X_multi)
plt.scatter(y_multi, predicted, alpha=0.6)
plt.plot([y_multi.min(), y_multi.max()], [y_multi.min(), y_multi.max()], 'r--', linewidth=2)
plt.title('Multiple Regression\n(Actual vs Predicted)')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.tight_layout()
plt.show()# Comparison of regression and classification
comparison_data = {
'Aspect': ['Target Variable', 'Output Type', 'Algorithms', 'Evaluation Metrics', 'Applications'],
'Regression': [
'Continuous numerical values',
'Real numbers ("�� possibilities)',
'Linear, Polynomial, Ridge, Lasso',
'MSE, RMSE, MAE, R²',
'Price prediction, forecasting'
],
'Classification': [
'Discrete categories/classes',
'Limited set of classes',
'Logistic, SVM, Decision Trees',
'Accuracy, Precision, Recall, F1',
'Spam detection, image recognition'
]
}
comparison_df = pd.DataFrame(comparison_data)
print("Regression vs Classification Comparison:")
print("=" * 60)
for _, row in comparison_df.iterrows():
print(f"{row['Aspect']:<20}: {row['Regression']:<35} | {row['Classification']}")Linear regression forms the theoretical backbone of statistical learning, providing both optimal parameter estimation through least squares and a complete probabilistic framework for inference about relationships between variables.
The Linear Model Framework
The population linear regression model assumes:
Y = Xβ + ε
Where: - Y “��”�✅� is the response vector - X “��”�✅�ˣᵖ is the design matrix (includes intercept column) - β “��”�“ᵖ is the parameter vector - ε ~ N(0, σ²I) is the error vector (iid Gaussian noise)
For the single-predictor case:
yᵢ = β”�” + β”��xᵢ + εᵢ, εᵢ ~ N(0, σ²)
Least Squares Principle
The method of least squares minimizes the residual sum of squares:
RSS(β) = Σᵢ”��“��”�� (yᵢ - β”�” - β”��xᵢ)²
Analytical Solution via Calculus
Setting partial derivatives to zero: - “��RSS/”��β”�” = 0 “�� β̂”�” = ȳ - β̂“��x̄ - ”��RSS/“��β”�� = 0 “�� β̂”�� = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²
Statistical Properties of Estimators
Under the Gauss-Markov conditions, the OLS estimators are: 1.
Unbiased: E[β̂] = β 2. Consistent: β̂
“�� β as n”�� “��
3. BLUE: Best Linear Unbiased Estimators (minimum
variance among all linear unbiased estimators) 4. Asymptotically
Normal: β̂ ~ N(β, σ²(X�”X)“��¹) for large n
class SimpleLinearRegression:
def __init__(self):
self.slope = None
self.intercept = None
def fit(self, X, y):
"""Fit the linear regression model"""
# Convert to numpy arrays
X = np.array(X).flatten()
y = np.array(y).flatten()
# Calculate means
x_mean = np.mean(X)
y_mean = np.mean(y)
# Calculate slope and intercept
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean) ** 2)
self.slope = numerator / denominator
self.intercept = y_mean - self.slope * x_mean
return self
def predict(self, X):
"""Make predictions"""
X = np.array(X).flatten()
return self.intercept + self.slope * X
def score(self, X, y):
"""Calculate R-squared"""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
# Generate sample data
np.random.seed(42)
X_simple = np.random.randn(100)
y_simple = 2 + 3 * X_simple + np.random.randn(100) * 0.5
# Fit custom implementation
slr_custom = SimpleLinearRegression()
slr_custom.fit(X_simple, y_simple)
# Compare with sklearn
from sklearn.linear_model import LinearRegression
slr_sklearn = LinearRegression()
slr_sklearn.fit(X_simple.reshape(-1, 1), y_simple)
print("Simple Linear Regression Comparison:")
print("=" * 40)
print(f"Custom Implementation:")
print(f" Slope: {slr_custom.slope:.4f}")
print(f" Intercept: {slr_custom.intercept:.4f}")
print(f" R²: {slr_custom.score(X_simple, y_simple):.4f}")
print(f"\nScikit-learn:")
print(f" Slope: {slr_sklearn.coef_[0]:.4f}")
print(f" Intercept: {slr_sklearn.intercept_:.4f}")
print(f" R²: {slr_sklearn.score(X_simple.reshape(-1, 1), y_simple):.4f}")
# Visualization
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_simple, y_simple, alpha=0.6, label='Data Points')
plt.plot(X_simple, slr_custom.predict(X_simple), color='red', linewidth=2, label='Fitted Line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.grid(True, alpha=0.3)
# Residual plot
plt.subplot(1, 2, 2)
residuals = y_simple - slr_custom.predict(X_simple)
plt.scatter(slr_custom.predict(X_simple), residuals, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Multiple linear regression extends to multivariate predictors using matrix algebra for elegant mathematical treatment and computational efficiency.
Matrix Formulation
The multiple regression model in matrix notation:
y = Xβ + ε
Where: - y “��”�✅� is the response vector - X “��”�✅�ˣ”��ᵖ”��¹”�� is the design matrix [1 x”�� x”�� … x”��] - β “��”�“ᵖ”��¹ is the parameter vector [β”�” β”�� … β”��]�” - ε “��”�✅� is the error vector
Least Squares Optimization
Minimizing the quadratic loss function:
RSS(β) = ||y - Xβ||² = (y - Xβ)�“(y - Xβ)
Normal Equations Derivation
Taking the gradient with respect to β:
**“��_β RSS = -2X�“y + 2X�”Xβ = 0**
Closed-Form Solution:
β̂ = (X�“X)”��¹X�“y
Geometric Interpretation
The OLS solution projects y onto the column space of X: - ŷ = Xβ̂ = X(X�“X)”��¹X�“y = Hy** (H is the”hat” matrix) - Residuals: e = y - ŷ are orthogonal to the column space of X - Projection**: ŷ is the closest point in Col(X) to y
Computational Considerations
# Load and explore Boston Housing dataset
from sklearn.datasets import fetch_california_housing
import warnings
warnings.filterwarnings('ignore')
# Note: Boston housing dataset is deprecated, using California housing instead
housing = fetch_california_housing()
X_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
y_housing = housing.target
print("California Housing Dataset:")
print("=" * 30)
print(f"Shape: {X_housing.shape}")
print(f"Features: {list(X_housing.columns)}")
print(f"Target: House prices in hundreds of thousands of dollars")
# Display basic statistics
print("\nDataset Statistics:")
print(X_housing.describe())
# Correlation analysis
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
correlation_matrix = X_housing.corrwith(pd.Series(y_housing))
correlation_matrix.plot(kind='bar')
plt.title('Feature Correlation with Price')
plt.ylabel('Correlation')
plt.xticks(rotation=45)
plt.subplot(1, 3, 2)
plt.scatter(X_housing['MedInc'], y_housing, alpha=0.5)
plt.xlabel('Median Income')
plt.ylabel('House Price')
plt.title('Income vs Price')
plt.subplot(1, 3, 3)
plt.scatter(X_housing['AveRooms'], y_housing, alpha=0.5)
plt.xlabel('Average Rooms')
plt.ylabel('House Price')
plt.title('Rooms vs Price')
plt.tight_layout()
plt.show()
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X_housing, y_housing, test_size=0.2, random_state=42
)
# Train multiple linear regression
mlr = LinearRegression()
mlr.fit(X_train, y_train)
# Make predictions
y_pred_train = mlr.predict(X_train)
y_pred_test = mlr.predict(X_test)
# Evaluate performance
from sklearn.metrics import mean_absolute_error, mean_squared_error
train_mse = mean_squared_error(y_train, y_pred_train)
test_mse = mean_squared_error(y_test, y_pred_test)
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
print("\nMultiple Linear Regression Results:")
print("=" * 40)
print(f"Training MSE: {train_mse:.4f}")
print(f"Testing MSE: {test_mse:.4f}")
print(f"Training R²: {train_r2:.4f}")
print(f"Testing R²: {test_r2:.4f}")
# Feature importance (coefficients)
feature_importance = pd.DataFrame({
'Feature': X_housing.columns,
'Coefficient': mlr.coef_
})
feature_importance = feature_importance.sort_values('Coefficient', key=abs, ascending=False)
print(f"\nFeature Importance (Coefficients):")
print(feature_importance)
# Visualization of results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred_test, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title(f'Actual vs Predicted\nR² = {test_r2:.3f}')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
feature_importance.plot(x='Feature', y='Coefficient', kind='bar', ax=plt.gca())
plt.title('Feature Coefficients')
plt.ylabel('Coefficient Value')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Linear regression makes several important assumptions:
def check_regression_assumptions(X, y, model, model_name="Linear Regression"):
"""
Check linear regression assumptions with visualizations
"""
y_pred = model.predict(X)
residuals = y - y_pred
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Linearity check (Actual vs Predicted)
axes[0,0].scatter(y, y_pred, alpha=0.6)
axes[0,0].plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=2)
axes[0,0].set_xlabel('Actual Values')
axes[0,0].set_ylabel('Predicted Values')
axes[0,0].set_title('Linearity Check: Actual vs Predicted')
axes[0,0].grid(True, alpha=0.3)
# 2. Homoscedasticity check (Residuals vs Fitted)
axes[0,1].scatter(y_pred, residuals, alpha=0.6)
axes[0,1].axhline(y=0, color='red', linestyle='--')
axes[0,1].set_xlabel('Fitted Values')
axes[0,1].set_ylabel('Residuals')
axes[0,1].set_title('Homoscedasticity Check: Residuals vs Fitted')
axes[0,1].grid(True, alpha=0.3)
# 3. Normality check (Q-Q plot)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1,0])
axes[1,0].set_title('Normality Check: Q-Q Plot')
axes[1,0].grid(True, alpha=0.3)
# 4. Residual distribution
axes[1,1].hist(residuals, bins=30, alpha=0.7, edgecolor='black')
axes[1,1].set_xlabel('Residuals')
axes[1,1].set_ylabel('Frequency')
axes[1,1].set_title('Residual Distribution')
axes[1,1].grid(True, alpha=0.3)
plt.suptitle(f'{model_name} - Assumption Checks', fontsize=16)
plt.tight_layout()
plt.show()
# Statistical tests
print(f"\n{model_name} - Assumption Test Results:")
print("=" * 50)
# Shapiro-Wilk test for normality
shapiro_stat, shapiro_p = stats.shapiro(residuals[:5000]) # Limit for large datasets
print(f"Shapiro-Wilk Test (Normality):")
print(f" Statistic: {shapiro_stat:.4f}, p-value: {shapiro_p:.4f}")
print(f" Normal residuals: {'Yes' if shapiro_p > 0.05 else 'No'}")
# Breusch-Pagan test for homoscedasticity (simplified)
mean_residual_sq = np.mean(residuals**2)
print(f"\nResidual Analysis:")
print(f" Mean squared residual: {mean_residual_sq:.4f}")
print(f" Standard deviation: {np.std(residuals):.4f}")
# Check assumptions for our housing model
check_regression_assumptions(X_test, y_test, mlr, "Multiple Linear Regression")When the relationship between variables is non-linear, polynomial regression can capture curved patterns by adding polynomial terms.
Polynomial regression extends linear regression by including polynomial terms:
y = β0 + β1x + β2x2 + β3x3 + ... + βdxd + ϵ
This is still a linear model in terms of the coefficients βi, but non-linear in terms of the features.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import validation_curve
# Generate non-linear data
np.random.seed(42)
X_poly_demo = np.linspace(-2, 2, 100).reshape(-1, 1)
y_poly_demo = 2 + 3*X_poly_demo.ravel() - 1.5*X_poly_demo.ravel()**2 + 0.5*X_poly_demo.ravel()**3 + np.random.normal(0, 0.5, 100)
# Compare different polynomial degrees
degrees = [1, 2, 3, 4, 6, 8]
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()
X_plot = np.linspace(-2, 2, 300).reshape(-1, 1)
for i, degree in enumerate(degrees):
# Create polynomial pipeline
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('linear', LinearRegression())
])
# Fit model
poly_pipeline.fit(X_poly_demo, y_poly_demo)
y_plot = poly_pipeline.predict(X_plot)
# Plot
axes[i].scatter(X_poly_demo, y_poly_demo, alpha=0.6, label='Data')
axes[i].plot(X_plot, y_plot, color='red', linewidth=2, label=f'Degree {degree}')
axes[i].set_title(f'Polynomial Degree {degree}')
axes[i].set_xlabel('X')
axes[i].set_ylabel('y')
axes[i].legend()
axes[i].grid(True, alpha=0.3)
# Calculate R²
r2 = poly_pipeline.score(X_poly_demo, y_poly_demo)
axes[i].text(0.05, 0.95, f'R² = {r2:.3f}', transform=axes[i].transAxes,
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
plt.tight_layout()
plt.show()
# Validation curve to find optimal degree
degrees_range = range(1, 16)
train_scores, val_scores = validation_curve(
Pipeline([('poly', PolynomialFeatures()), ('linear', LinearRegression())]),
X_poly_demo, y_poly_demo,
param_name='poly__degree', param_range=degrees_range,
cv=5, scoring='neg_mean_squared_error'
)
# Convert to positive MSE
train_mse = -train_scores
val_mse = -val_scores
plt.figure(figsize=(10, 6))
plt.plot(degrees_range, np.mean(train_mse, axis=1), 'o-', color='blue', label='Training MSE')
plt.plot(degrees_range, np.mean(val_mse, axis=1), 'o-', color='red', label='Validation MSE')
plt.fill_between(degrees_range,
np.mean(train_mse, axis=1) - np.std(train_mse, axis=1),
np.mean(train_mse, axis=1) + np.std(train_mse, axis=1),
color='blue', alpha=0.1)
plt.fill_between(degrees_range,
np.mean(val_mse, axis=1) - np.std(val_mse, axis=1),
np.mean(val_mse, axis=1) + np.std(val_mse, axis=1),
color='red', alpha=0.1)
plt.xlabel('Polynomial Degree')
plt.ylabel('Mean Squared Error')
plt.title('Bias-Variance Tradeoff in Polynomial Regression')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
optimal_degree = degrees_range[np.argmin(np.mean(val_mse, axis=1))]
print(f"Optimal polynomial degree: {optimal_degree}")Regularization prevents overfitting by adding a penalty term to the cost function, constraining the model complexity.
Ridge regression addresses the fundamental challenge of overfitting by introducing a bias-variance trade-off through L2 regularization, providing both theoretical guarantees and practical benefits for high-dimensional problems.
Regularized Optimization Problem
Ridge regression modifies the OLS objective with an L2 penalty term:
minimize_β ||y - Xβ||² + λ||β||²
Where: - λ > 0 is the regularization parameter - ||β||² = Σ”�� β”��² is the L2 norm of coefficients
Closed-Form Solution
The regularized normal equations yield:
**β̂_ridge = (X^T X + λI)^{-1} X^T y**
Key Mathematical Properties:
Bayesian Interpretation
Ridge regression corresponds to MAP estimation with Gaussian priors:
β ~ N(0, σ²/λ I)
The regularization parameter λ = σ²/τ² where τ² is the prior variance.
Bias-Variance Decomposition
Ridge regression introduces bias to reduce variance: - Bias increases: E[β̂_ridge] “�� β (shrinkage introduces bias) - Variance decreases: Var[β̂_ridge] < Var[β̂_OLS] (regularization reduces variance) - MSE Optimum: Optimal λ minimizes Bias² + Variance
Effective Degrees of Freedom
Ridge regression’s model complexity can be quantified as:
df(λ) = trace(X(X^T X + λI)^{-1} X^T) = Σᵢ σᵢ²/(σᵢ² + λ)
Where σᵢ² are eigenvalues of X^T X. As λ “�� 0, df”�� p (OLS); as λ “��”��, df “�� 0.
from sklearn.linear_model import Ridge, RidgeCV
# Ridge regression implementation and comparison
ridge_alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
# Split housing data for regularization demo
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X_housing, y_housing, test_size=0.3, random_state=42
)
# Standardize features (important for regularization)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_reg)
X_test_scaled = scaler.transform(X_test_reg)
ridge_results = {}
print("Ridge Regression Results:")
print("=" * 50)
for alpha in ridge_alphas:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train_scaled, y_train_reg)
train_score = ridge.score(X_train_scaled, y_train_reg)
test_score = ridge.score(X_test_scaled, y_test_reg)
ridge_results[alpha] = {
'train_r2': train_score,
'test_r2': test_score,
'coefficients': ridge.coef_
}
print(f"Alpha {alpha:6.3f}: Train R² = {train_score:.4f}, Test R² = {test_score:.4f}")
# Cross-validation to find optimal alpha
ridge_cv = RidgeCV(alphas=ridge_alphas, cv=5)
ridge_cv.fit(X_train_scaled, y_train_reg)
print(f"\nOptimal Alpha (CV): {ridge_cv.alpha_}")
print(f"CV Score: {ridge_cv.score(X_test_scaled, y_test_reg):.4f}")
# Visualize coefficient paths
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
for i, feature in enumerate(X_housing.columns):
coef_path = [ridge_results[alpha]['coefficients'][i] for alpha in ridge_alphas]
plt.plot(ridge_alphas, coef_path, 'o-', label=feature)
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficient Value')
plt.title('Ridge Regression Coefficient Paths')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 2)
train_scores = [ridge_results[alpha]['train_r2'] for alpha in ridge_alphas]
test_scores = [ridge_results[alpha]['test_r2'] for alpha in ridge_alphas]
plt.plot(ridge_alphas, train_scores, 'o-', label='Train R²')
plt.plot(ridge_alphas, test_scores, 'o-', label='Test R²')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('R² Score')
plt.title('Ridge Regression Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 3)
# Compare coefficients: Linear vs Ridge
linear_coefs = LinearRegression().fit(X_train_scaled, y_train_reg).coef_
ridge_coefs = ridge_cv.coef_
x_pos = np.arange(len(X_housing.columns))
width = 0.35
plt.bar(x_pos - width/2, linear_coefs, width, label='Linear Regression', alpha=0.7)
plt.bar(x_pos + width/2, ridge_coefs, width, label='Ridge Regression', alpha=0.7)
plt.xlabel('Features')
plt.ylabel('Coefficient Value')
plt.title('Coefficient Comparison')
plt.xticks(x_pos, X_housing.columns, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Lasso regression uses L1 regularization, which can drive coefficients to exactly zero:
$$J(\boldsymbol{\beta}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}|\beta_j|$$
This enables automatic feature selection.
from sklearn.linear_model import Lasso, LassoCV
# Lasso regression analysis
lasso_alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
lasso_results = {}
print("\nLasso Regression Results:")
print("=" * 50)
for alpha in lasso_alphas:
lasso = Lasso(alpha=alpha, max_iter=2000)
lasso.fit(X_train_scaled, y_train_reg)
train_score = lasso.score(X_train_scaled, y_train_reg)
test_score = lasso.score(X_test_scaled, y_test_reg)
# Count non-zero coefficients
non_zero_coefs = np.sum(lasso.coef_ != 0)
lasso_results[alpha] = {
'train_r2': train_score,
'test_r2': test_score,
'coefficients': lasso.coef_,
'non_zero_coefs': non_zero_coefs
}
print(f"Alpha {alpha:6.3f}: Train R² = {train_score:.4f}, Test R² = {test_score:.4f}, Features: {non_zero_coefs}")
# Cross-validation for optimal alpha
lasso_cv = LassoCV(alphas=lasso_alphas, cv=5, max_iter=2000)
lasso_cv.fit(X_train_scaled, y_train_reg)
print(f"\nOptimal Alpha (CV): {lasso_cv.alpha_:.4f}")
print(f"CV Score: {lasso_cv.score(X_test_scaled, y_test_reg):.4f}")
# Feature selection analysis
selected_features = X_housing.columns[lasso_cv.coef_ != 0]
print(f"\nSelected Features by Lasso: {len(selected_features)} out of {len(X_housing.columns)}")
for feature, coef in zip(X_housing.columns, lasso_cv.coef_):
if coef != 0:
print(f" {feature}: {coef:.4f}")
# Visualize Lasso paths
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
for i, feature in enumerate(X_housing.columns):
coef_path = [lasso_results[alpha]['coefficients'][i] for alpha in lasso_alphas]
plt.plot(lasso_alphas, coef_path, 'o-', label=feature)
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Coefficient Value')
plt.title('Lasso Regression Coefficient Paths')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 2)
train_scores_lasso = [lasso_results[alpha]['train_r2'] for alpha in lasso_alphas]
test_scores_lasso = [lasso_results[alpha]['test_r2'] for alpha in lasso_alphas]
plt.plot(lasso_alphas, train_scores_lasso, 'o-', label='Train R²')
plt.plot(lasso_alphas, test_scores_lasso, 'o-', label='Test R²')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('R² Score')
plt.title('Lasso Regression Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 3)
num_features = [lasso_results[alpha]['non_zero_coefs'] for alpha in lasso_alphas]
plt.plot(lasso_alphas, num_features, 'o-', color='green')
plt.xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Number of Selected Features')
plt.title('Feature Selection by Lasso')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Elastic Net combines both L1 and L2 regularization:
$$J(\boldsymbol{\beta}) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)})^2 + \lambda_1\sum_{j=1}^{n}|\beta_j| + \lambda_2\sum_{j=1}^{n}\beta_j^2$$
This provides a balance between Ridge’s stability and Lasso’s feature selection.
from sklearn.linear_model import ElasticNet, ElasticNetCV
# Elastic Net with different l1_ratio values
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9] # 0 = Ridge, 1 = Lasso
elastic_results = {}
print("Elastic Net Results:")
print("=" * 40)
for l1_ratio in l1_ratios:
elastic_cv = ElasticNetCV(l1_ratio=l1_ratio, cv=5, max_iter=2000)
elastic_cv.fit(X_train_scaled, y_train_reg)
test_score = elastic_cv.score(X_test_scaled, y_test_reg)
non_zero_coefs = np.sum(elastic_cv.coef_ != 0)
elastic_results[l1_ratio] = {
'test_r2': test_score,
'alpha': elastic_cv.alpha_,
'non_zero_coefs': non_zero_coefs
}
print(f"L1 ratio {l1_ratio:.1f}: R² = {test_score:.4f}, Alpha = {elastic_cv.alpha_:.4f}, Features = {non_zero_coefs}")
# Find best l1_ratio
best_l1_ratio = max(elastic_results.keys(), key=lambda k: elastic_results[k]['test_r2'])
print(f"\nBest L1 ratio: {best_l1_ratio} (R² = {elastic_results[best_l1_ratio]['test_r2']:.4f})")Proper evaluation is crucial for understanding regression model performance and comparing different approaches.
Regression evaluation metrics quantify different aspects of prediction quality, each with specific mathematical properties and interpretations rooted in statistical theory.
Loss Function Perspective
Different metrics correspond to different loss functions being optimized:
Mean Absolute Error (L1 Loss) MAE = 1/n Σᵢ|yᵢ - ŷᵢ|
Mean Squared Error (L2 Loss) MSE = 1/n Σᵢ(yᵢ - ŷᵢ)²
Root Mean Squared Error RMSE = “��MSE
R-squared: Coefficient of Determination
R² = 1 - SSres/SStot = 1 - Σ(yᵢ - ŷᵢ)²/Σ(yᵢ - ȳ)²
Statistical Interpretation: - Proportion of variance explained by the model - Range: (-“��, 1], where 1 = perfect fit, 0 = no better than mean - Baseline comparison: Compares model against naive mean predictor
Adjusted R-squared: Complexity Penalty
R²adj = 1 - (1 - R²)(n-1)/(n-p-1)
Purpose: Penalizes model complexity to prevent overfitting - Decreases when adding irrelevant features (even if R² increases) - Model selection: Favors parsimonious models - Degrees of freedom: Accounts for parameters used
Information-Theoretic Metrics
Akaike Information Criterion (AIC) AIC = 2p - 2ln(L) “�� n ln(MSE) + 2p
Bayesian Information Criterion (BIC)
BIC = p ln(n) - 2ln(L) “�� n ln(MSE) + p ln(n)
Both penalize complexity but BIC more heavily for large n.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def comprehensive_regression_evaluation(models_dict, X_train, X_test, y_train, y_test):
"""
Comprehensive evaluation of multiple regression models
"""
results = {}
print("Comprehensive Model Evaluation:")
print("=" * 80)
print(f"{'Model':<20} {'MAE':<8} {'MSE':<8} {'RMSE':<8} {'R²':<8} {'Adj R²':<8}")
print("-" * 80)
for name, model in models_dict.items():
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Calculate metrics
mae = mean_absolute_error(y_test, y_pred_test)
mse = mean_squared_error(y_test, y_pred_test)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_test)
# Adjusted R²
n = len(y_test)
p = X_test.shape[1] if hasattr(X_test, 'shape') else len(X_test[0])
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
results[name] = {
'MAE': mae,
'MSE': mse,
'RMSE': rmse,
'R²': r2,
'Adjusted R²': adj_r2,
'predictions': y_pred_test
}
print(f"{name:<20} {mae:<8.4f} {mse:<8.4f} {rmse:<8.4f} {r2:<8.4f} {adj_r2:<8.4f}")
return results
# Prepare models for comparison
models_comparison = {
'Linear Regression': LinearRegression().fit(X_train_scaled, y_train_reg),
'Ridge (CV)': ridge_cv,
'Lasso (CV)': lasso_cv,
'Elastic Net': ElasticNetCV(cv=5, max_iter=2000).fit(X_train_scaled, y_train_reg),
'Polynomial (deg=2)': Pipeline([
('poly', PolynomialFeatures(2)),
('linear', LinearRegression())
]).fit(X_train_scaled, y_train_reg)
}
# Evaluate all models
evaluation_results = comprehensive_regression_evaluation(
models_comparison, X_train_scaled, X_test_scaled, y_train_reg, y_test_reg
)
# Visualize model performance
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Metrics comparison
metrics = ['MAE', 'MSE', 'RMSE', 'R²']
for i, metric in enumerate(metrics):
ax = axes[i//2, i%3]
values = [evaluation_results[model][metric] for model in models_comparison.keys()]
bars = ax.bar(range(len(values)), values)
ax.set_title(f'{metric} Comparison')
ax.set_ylabel(metric)
ax.set_xticks(range(len(values)))
ax.set_xticklabels(models_comparison.keys(), rotation=45)
# Add value labels on bars
for j, bar in enumerate(bars):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height,
f'{height:.3f}', ha='center', va='bottom', fontsize=8)
# Actual vs Predicted for best model
best_model_name = max(evaluation_results.keys(), key=lambda k: evaluation_results[k]['R²'])
ax = axes[1, 2]
best_predictions = evaluation_results[best_model_name]['predictions']
ax.scatter(y_test_reg, best_predictions, alpha=0.6)
ax.plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 'r--', linewidth=2)
ax.set_xlabel('Actual Values')
ax.set_ylabel('Predicted Values')
ax.set_title(f'Best Model: {best_model_name}\nR² = {evaluation_results[best_model_name]["R²"]:.4f}')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nBest performing model: {best_model_name}")from sklearn.model_selection import cross_val_score, KFold
def regression_cross_validation(models_dict, X, y, cv_folds=5):
"""
Perform cross-validation for regression models
"""
kfold = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
cv_results = {}
print(f"{cv_folds}-Fold Cross-Validation Results:")
print("=" * 60)
print(f"{'Model':<20} {'Mean R²':<10} {'Std R²':<10} {'Mean RMSE':<12} {'Std RMSE':<10}")
print("-" * 60)
for name, model in models_dict.items():
# R² scores
r2_scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')
# RMSE scores (note: sklearn returns negative MSE, so we need to convert)
neg_mse_scores = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-neg_mse_scores)
cv_results[name] = {
'r2_mean': r2_scores.mean(),
'r2_std': r2_scores.std(),
'rmse_mean': rmse_scores.mean(),
'rmse_std': rmse_scores.std()
}
print(f"{name:<20} {r2_scores.mean():<10.4f} {r2_scores.std():<10.4f} "
f"{rmse_scores.mean():<12.4f} {rmse_scores.std():<10.4f}")
return cv_results
# Perform cross-validation
cv_results = regression_cross_validation(models_comparison, X_train_scaled, y_train_reg)
# Visualize CV results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# R² comparison with error bars
model_names = list(cv_results.keys())
r2_means = [cv_results[name]['r2_mean'] for name in model_names]
r2_stds = [cv_results[name]['r2_std'] for name in model_names]
ax1.bar(range(len(model_names)), r2_means, yerr=r2_stds, capsize=5, alpha=0.7)
ax1.set_title('Cross-Validation R² Scores')
ax1.set_ylabel('R² Score')
ax1.set_xticks(range(len(model_names)))
ax1.set_xticklabels(model_names, rotation=45)
ax1.grid(True, alpha=0.3)
# RMSE comparison with error bars
rmse_means = [cv_results[name]['rmse_mean'] for name in model_names]
rmse_stds = [cv_results[name]['rmse_std'] for name in model_names]
ax2.bar(range(len(model_names)), rmse_means, yerr=rmse_stds, capsize=5, alpha=0.7, color='orange')
ax2.set_title('Cross-Validation RMSE Scores')
ax2.set_ylabel('RMSE')
ax2.set_xticks(range(len(model_names)))
ax2.set_xticklabels(model_names, rotation=45)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()from sklearn.model_selection import learning_curve
def plot_learning_curve(model, X, y, title="Learning Curve"):
"""
Plot learning curve to diagnose bias/variance
"""
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10),
scoring='r2', n_jobs=-1
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
color='blue', alpha=0.1)
plt.plot(train_sizes, val_mean, 'o-', color='red', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
color='red', alpha=0.1)
plt.xlabel('Training Set Size')
plt.ylabel('R² Score')
plt.title(title)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Diagnosis
final_gap = train_mean[-1] - val_mean[-1]
if final_gap > 0.1:
print(✅�️ High variance (overfitting) detected")
print(" Consider: more data, regularization, simpler model")
elif val_mean[-1] < 0.6:
print(✅�️ High bias (underfitting) detected")
print(" Consider: more complex model, more features")
else:
print("✅ Model appears well-fitted")
# Plot learning curves for different models
for name, model in [('Linear Regression', LinearRegression()),
('Ridge Regression', Ridge(alpha=1.0)),
('Lasso Regression', Lasso(alpha=0.1))]:
print(f"\nLearning Curve Analysis: {name}")
plot_learning_curve(model, X_train_scaled, y_train_reg, f"Learning Curve - {name}")Let’s apply regression techniques to a real-world sales forecasting problem.
# Create a realistic sales dataset
np.random.seed(42)
n_samples = 1000
# Generate features
advertising_spend = np.random.exponential(10, n_samples)
season = np.random.randint(1, 5, n_samples) # 1-4 representing quarters
price = np.random.normal(100, 20, n_samples)
competitor_price = price + np.random.normal(0, 10, n_samples)
economic_index = np.random.normal(100, 15, n_samples)
# Generate target with realistic relationships
sales = (
50 + # Base sales
2.5 * advertising_spend + # Advertising effect
np.where(season == 4, 20, 0) + # Holiday season boost
-0.8 * price + # Price sensitivity
0.3 * competitor_price + # Competitor effect
0.1 * economic_index + # Economic conditions
np.random.normal(0, 10, n_samples) # Noise
)
# Create DataFrame
sales_data = pd.DataFrame({
'advertising_spend': advertising_spend,
'season': season,
'price': price,
'competitor_price': competitor_price,
'economic_index': economic_index,
'sales': sales
})
print("Sales Forecasting Dataset:")
print("=" * 30)
print(sales_data.describe())
# Feature engineering
sales_data['price_difference'] = sales_data['competitor_price'] - sales_data['price']
sales_data['advertising_per_price'] = sales_data['advertising_spend'] / sales_data['price']
# One-hot encode season
sales_encoded = pd.get_dummies(sales_data, columns=['season'], prefix='season')
# Prepare features and target
X_sales = sales_encoded.drop('sales', axis=1)
y_sales = sales_encoded['sales']
# Split data
X_train_sales, X_test_sales, y_train_sales, y_test_sales = train_test_split(
X_sales, y_sales, test_size=0.2, random_state=42
)
# Scale features
scaler_sales = StandardScaler()
X_train_sales_scaled = scaler_sales.fit_transform(X_train_sales)
X_test_sales_scaled = scaler_sales.transform(X_test_sales)
# Apply different regression models
sales_models = {
'Linear Regression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'Polynomial (degree=2)': Pipeline([
('poly', PolynomialFeatures(2, interaction_only=True)),
('linear', LinearRegression())
])
}
sales_results = {}
print("\nSales Forecasting Model Comparison:")
print("=" * 50)
for name, model in sales_models.items():
# Fit model
if name == 'Polynomial (degree=2)':
model.fit(X_train_sales, y_train_sales) # Don't scale for polynomial
y_pred = model.predict(X_test_sales)
score = r2_score(y_test_sales, y_pred)
else:
model.fit(X_train_sales_scaled, y_train_sales)
y_pred = model.predict(X_test_sales_scaled)
score = model.score(X_test_sales_scaled, y_test_sales)
mae = mean_absolute_error(y_test_sales, y_pred)
rmse = np.sqrt(mean_squared_error(y_test_sales, y_pred))
sales_results[name] = {
'R²': score,
'MAE': mae,
'RMSE': rmse,
'predictions': y_pred
}
print(f"{name:<25}: R² = {score:.4f}, MAE = {mae:.2f}, RMSE = {rmse:.2f}")
# Business insights from best model
best_sales_model = max(sales_results.keys(), key=lambda k: sales_results[k]['R²'])
print(f"\nBest model: {best_sales_model}")
# Feature importance analysis (for linear model)
if best_sales_model == 'Linear Regression':
lr_sales = LinearRegression().fit(X_train_sales_scaled, y_train_sales)
feature_importance = pd.DataFrame({
'Feature': X_sales.columns,
'Coefficient': lr_sales.coef_
}).sort_values('Coefficient', key=abs, ascending=False)
print("\nFeature Importance (Business Insights):")
print("-" * 40)
for _, row in feature_importance.iterrows():
direction = "increases" if row['Coefficient'] > 0 else "decreases"
print(f"✅ {row['Feature']:<20}: {direction} sales by {abs(row['Coefficient']):.2f} units per unit change")
# Visualize results
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
# Model performance comparison
models_list = list(sales_results.keys())
r2_scores = [sales_results[model]['R²'] for model in models_list]
plt.bar(range(len(models_list)), r2_scores)
plt.title('Model Performance Comparison')
plt.ylabel('R² Score')
plt.xticks(range(len(models_list)), models_list, rotation=45)
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 2)
# Best model predictions
best_pred = sales_results[best_sales_model]['predictions']
plt.scatter(y_test_sales, best_pred, alpha=0.6)
plt.plot([y_test_sales.min(), y_test_sales.max()], [y_test_sales.min(), y_test_sales.max()], 'r--')
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title(f'{best_sales_model}\nActual vs Predicted')
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 3)
# Residual analysis
residuals = y_test_sales - best_pred
plt.scatter(best_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted Sales')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()# Model persistence and deployment pipeline
import joblib
from datetime import datetime
class SalesForecaster:
"""
Production-ready sales forecasting model
"""
def __init__(self):
self.model = None
self.scaler = None
self.feature_names = None
self.model_version = None
def train(self, X_train, y_train, model_type='ridge'):
"""Train the forecasting model"""
self.feature_names = X_train.columns.tolist()
# Initialize scaler
self.scaler = StandardScaler()
X_train_scaled = self.scaler.fit_transform(X_train)
# Select and train model
if model_type == 'ridge':
self.model = Ridge(alpha=1.0)
elif model_type == 'lasso':
self.model = Lasso(alpha=0.1)
else:
self.model = LinearRegression()
self.model.fit(X_train_scaled, y_train)
self.model_version = datetime.now().strftime("%Y%m%d_%H%M%S")
print(f"Model trained successfully (version: {self.model_version})")
def predict(self, X_new):
"""Make predictions on new data"""
if self.model is None:
raise ValueError("Model must be trained before making predictions")
# Ensure feature consistency
X_new = X_new[self.feature_names]
# Scale features
X_new_scaled = self.scaler.transform(X_new)
# Make prediction
predictions = self.model.predict(X_new_scaled)
return predictions
def get_feature_importance(self):
"""Get feature importance for business insights"""
if self.model is None:
raise ValueError("Model must be trained first")
importance_df = pd.DataFrame({
'feature': self.feature_names,
'importance': np.abs(self.model.coef_)
}).sort_values('importance', ascending=False)
return importance_df
def save_model(self, filepath):
"""Save model to disk"""
model_data = {
'model': self.model,
'scaler': self.scaler,
'feature_names': self.feature_names,
'version': self.model_version
}
joblib.dump(model_data, filepath)
print(f"Model saved to {filepath}")
def load_model(self, filepath):
"""Load model from disk"""
model_data = joblib.load(filepath)
self.model = model_data['model']
self.scaler = model_data['scaler']
self.feature_names = model_data['feature_names']
self.model_version = model_data['version']
print(f"Model loaded (version: {self.model_version})")
# Example usage
forecaster = SalesForecaster()
forecaster.train(X_train_sales, y_train_sales, model_type='ridge')
# Make predictions on new data
sample_data = X_test_sales.head(5)
predictions = forecaster.predict(sample_data)
print("\nSample Predictions:")
print("=" * 30)
for i, (idx, row) in enumerate(sample_data.iterrows()):
print(f"Sample {i+1}: Predicted sales = ${predictions[i]:.2f}")
print(f" Advertising: ${row['advertising_spend']:.2f}")
print(f" Price: ${row['price']:.2f}")
print()
# Save model for deployment
# forecaster.save_model('sales_forecaster_model.pkl')def regression_best_practices_checklist():
"""
Comprehensive checklist for regression projects
"""
checklist = {
"Data Preparation": [
""'� Check for missing values and handle appropriately",
""'� Identify and handle outliers",
""'� Examine feature distributions and transform if needed",
""'� Create meaningful feature interactions",
""'� Scale/normalize features for regularized models",
""'� Split data into train/validation/test sets"
],
"Model Selection": [
""'� Start with simple linear regression baseline",
""'� Try regularized models (Ridge/Lasso) if overfitting",
""'� Consider polynomial features for non-linear relationships",
""'� Use cross-validation for hyperparameter tuning",
""'� Compare multiple algorithms systematically"
],
"Evaluation": [
""'� Use appropriate metrics (R², MAE, RMSE)",
""'� Check model assumptions (linearity, homoscedasticity)",
""'� Analyze residuals for patterns",
""'� Perform cross-validation for robust estimates",
""'� Test on truly unseen data"
],
"Deployment": [
""'� Document model assumptions and limitations",
""'� Implement prediction confidence intervals",
""'� Set up model monitoring and retraining pipeline",
""'� Consider business constraints and interpretability needs",
""'� Plan for model updates as data changes"
]
}
print("Regression Best Practices Checklist:")
print("=" * 50)
for category, items in checklist.items():
print(f"\n{category}:")
for item in items:
print(f" {item}")
regression_best_practices_checklist()| Pitfall | Problem | Solution |
|---|---|---|
| Data Leakage | Using future information to predict past | Ensure temporal ordering, proper train/test splits |
| Overfitting | Model too complex for available data | Use regularization, cross-validation, more data |
| Multicollinearity | Highly correlated features | Use VIF analysis, Ridge regression, PCA |
| Non-linearity | Linear model for non-linear relationships | Polynomial features, non-linear models |
| Heteroscedasticity | Non-constant error variance | Transform target variable, robust regression |
| Outliers | Extreme values affecting model | Robust regression, outlier detection/removal |
1. Statistical Learning Foundation: Regression seeks to learn E[Y|X] through least squares optimization, providing both predictive power and statistical inference capabilities under appropriate distributional assumptions.
2. Matrix Algebra and Optimization: The normal equations β̂ = (X�“X)”��¹X�“y provide the closed-form OLS solution, with geometric interpretation as projection onto the column space of X.
3. Bias-Variance Trade-off Principles: - OLS: Unbiased but high variance in high dimensions - Ridge: Introduces bias to reduce variance via L2 regularization - Optimal λ: Minimizes total MSE through bias-variance balance
4. Statistical Properties of Estimators: Under Gauss-Markov conditions, OLS estimators are BLUE (Best Linear Unbiased Estimators) with well-defined asymptotic distributions for inference.
5. Regularization Theory: Ridge regression β̂ = (X�“X + λI)”��¹X�“y ensures invertibility and provides shrinkage with Bayesian interpretation as MAP estimation with Gaussian priors.
6. Loss Function Perspectives: Different metrics
optimize different objectives: - MSE: Optimizes
conditional mean (L2 loss) - MAE: Optimizes conditional
median (L1 loss)
- R²: Measures proportion of variance explained
7. Model Selection Criteria: AIC and BIC provide principled approaches to model complexity selection through information-theoretic penalization of parameters.
8. Assumption Validation: Statistical validity requires checking linearity, independence, homoscedasticity, and normality assumptions through residual analysis and diagnostic tests.
Using the advertising dataset: 1. Implement simple linear regression from scratch 2. Compare with scikit-learn implementation 3. Analyze residuals and check assumptions 4. Calculate confidence intervals for predictions
With the Boston/California housing dataset: 1. Perform exploratory data analysis 2. Create polynomial and interaction features 3. Build multiple regression models 4. Interpret coefficients in business terms
Using a high-dimensional dataset: 1. Compare Ridge, Lasso, and Elastic Net 2. Use cross-validation for hyperparameter tuning 3. Analyze feature selection by Lasso 4. Evaluate bias-variance tradeoff
For any regression problem: 1. Implement all regression metrics from scratch 2. Create comprehensive residual analysis 3. Perform cross-validation with multiple metrics 4. Diagnose overfitting/underfitting using learning curves
Choose a real-world regression problem: 1. Data collection and preprocessing 2. Exploratory data analysis and feature engineering 3. Model selection and hyperparameter tuning 4. Evaluation and business interpretation 5. Deployment pipeline design
This completes Chapter 5: Regression Algorithms. You now have a comprehensive understanding of regression techniques, from simple linear regression to advanced regularization methods, along with proper evaluation and deployment practices. # Chapter 6: Clustering Algorithms
CO5 - Apply unsupervised learning models
By the end of this chapter, students will be able to: - Understand the fundamentals of clustering and its applications - Implement K-Means clustering algorithm with proper parameter tuning - Apply hierarchical clustering techniques for data analysis - Use advanced clustering methods like DBSCAN and Gaussian Mixture Models - Evaluate clustering results using appropriate metrics - Visualize clustering outcomes for business insights
Clustering represents a fundamental challenge in unsupervised learning: discovering latent structure in data without explicit guidance. From a statistical perspective, clustering seeks to identify natural groupings that reflect the underlying data generating process.
Mathematical Framework of Clustering
Given a dataset X = {x”��, x”��, …, x”�’} where xᵢ “��”�“ᵈ, clustering algorithms seek to partition the data into k clusters C = {C”��, C”��, …, C”��} such that:
Information-Theoretic Perspective
Clustering can be viewed as data compression, where we replace individual data points with cluster representatives. The optimal clustering minimizes information loss while maximizing compression.
Statistical Assumptions
Different clustering algorithms embody different assumptions about cluster structure: - Spherical clusters: K-means assumes clusters are spherical with similar sizes - Arbitrary shapes: DBSCAN can discover clusters of arbitrary density and shape - Probabilistic structure: Gaussian Mixture Models assume clusters follow multivariate Gaussian distributions
This chapter explores how these theoretical foundations translate into practical algorithms for discovering meaningful patterns in real-world data.
What you’ll learn: - Clustering fundamentals and evaluation metrics - K-Means algorithm implementation and optimization - Hierarchical clustering approaches and dendrograms - Advanced techniques: DBSCAN, Gaussian Mixture Models - Real-world applications and case studies - Visualization techniques for clustering results
Clustering is an unsupervised learning technique that groups similar data points together while separating dissimilar ones. The goal is to discover hidden structures in data without prior knowledge of group labels.
Key Characteristics: - Unsupervised: No target variable or labels provided - Exploratory: Discovers hidden patterns in data - Grouping: Creates meaningful segments or clusters - Similarity-based: Groups similar observations together
# Example: E-commerce customer clustering
customers_features = [
'annual_spending', 'purchase_frequency',
'avg_order_value', 'customer_lifetime_value'
]
# Result: High-value, Medium-value, Low-value customer segments| Aspect | Clustering | Classification |
|---|---|---|
| Learning Type | Unsupervised | Supervised |
| Labels | No labels provided | Labeled training data |
| Objective | Discover hidden groups | Predict class labels |
| Evaluation | Internal measures | External accuracy metrics |
| Applications | Exploratory analysis | Prediction tasks |
# Common approaches:
# - Elbow method
# - Silhouette analysis
# - Gap statistic
# - Domain expertise1. Silhouette Score - Measures how similar objects
are within clusters - Range: [-1, 1], higher is better - Formula:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
import numpy as np
# Example calculation
X = np.random.rand(100, 2) # Sample data
kmeans = KMeans(n_clusters=3)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score:.3f}")2. Davies-Bouldin Index - Lower values indicate better clustering - Measures average similarity between clusters
3. Calinski-Harabasz Index - Ratio of between-cluster to within-cluster dispersion - Higher values indicate better clustering
1. Adjusted Rand Index (ARI) - Measures similarity between true and predicted clusters - Range: [-1, 1], 1 is perfect matching
2. Normalized Mutual Information (NMI) - Measures shared information between clusterings - Range: [0, 1], 1 is perfect matching
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
# Example with ground truth
true_labels = [0, 0, 1, 1, 2, 2]
pred_labels = [0, 0, 1, 1, 2, 2]
ari = adjusted_rand_score(true_labels, pred_labels)
nmi = normalized_mutual_info_score(true_labels, pred_labels)
print(f"ARI: {ari:.3f}, NMI: {nmi:.3f}")import numpy as np
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
# Example
point1 = np.array([1, 2])
point2 = np.array([4, 6])
distance = euclidean_distance(point1, point2)
print(f"Euclidean Distance: {distance:.3f}")def manhattan_distance(x1, x2):
return np.sum(np.abs(x1 - x2))
distance = manhattan_distance(point1, point2)
print(f"Manhattan Distance: {distance:.3f}")from sklearn.metrics.pairwise import cosine_similarity
def cosine_distance(x1, x2):
similarity = cosine_similarity([x1], [x2])[0, 0]
return 1 - similarity
distance = cosine_distance(point1, point2)
print(f"Cosine Distance: {distance:.3f}")from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
# Example dataset
data = pd.DataFrame({
'age': [25, 30, 35, 40],
'income': [30000, 50000, 75000, 90000],
'spending': [500, 1200, 2000, 2500]
})
# Standard Scaling (Z-score normalization)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Min-Max Scaling
minmax_scaler = MinMaxScaler()
data_minmax = minmax_scaler.fit_transform(data)
print("Original Data:")
print(data)
print("\nStandardized Data:")
print(data_scaled)from sklearn.impute import SimpleImputer, KNNImputer
# Simple imputation
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)
# KNN imputation (more sophisticated)
knn_imputer = KNNImputer(n_neighbors=3)
data_knn_imputed = knn_imputer.fit_transform(data)from sklearn.decomposition import PCA
# Apply PCA before clustering for high-dimensional data
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)
print(f"Explained Variance Ratio: {pca.explained_variance_ratio_}")
print(f"Total Variance Explained: {sum(pca.explained_variance_ratio_):.3f}")K-Means represents a classical example of coordinate descent optimization applied to clustering. The algorithm alternates between two optimization steps, each reducing the objective function until convergence to a local optimum.
Mathematical Formulation
K-Means solves the following optimization problem:
minimize_{C”��,…,C”��,μ”��,…,μ”��} J = Σᵢ”��“��ᵏ Σ”��“��Cᵢ ||x - μᵢ||²
Subject to: - Partition constraint: Each point belongs to exactly one cluster - Centroid constraint: μᵢ minimizes within-cluster variance
Coordinate Descent Algorithm (Lloyd’s Algorithm)
The K-Means algorithm alternates between two optimization steps:
Assignment Step (fix centroids, optimize assignments): **C*ᵢ = {x : ||x - μᵢ||² “�� ||x - μ”��||² “�”j}**
Update Step (fix assignments, optimize centroids): **μ*ᵢ = argmin_μ Σ”��“��Cᵢ ||x - μ||² = (1/|Cᵢ|) Σ”��“��Cᵢ x**
Convergence Properties
Computational Complexity
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
class KMeansFromScratch:
def __init__(self, k=3, max_iters=100, random_state=42):
self.k = k
self.max_iters = max_iters
self.random_state = random_state
def initialize_centroids(self, X):
"""Initialize centroids randomly"""
np.random.seed(self.random_state)
n_samples, n_features = X.shape
centroids = np.zeros((self.k, n_features))
for i in range(self.k):
centroids[i] = X[np.random.randint(0, n_samples)]
return centroids
def assign_clusters(self, X, centroids):
"""Assign each point to nearest centroid"""
distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
return np.argmin(distances, axis=0)
def update_centroids(self, X, labels):
"""Update centroids to mean of assigned points"""
centroids = np.zeros((self.k, X.shape[1]))
for i in range(self.k):
if np.sum(labels == i) > 0:
centroids[i] = X[labels == i].mean(axis=0)
return centroids
def calculate_wcss(self, X, labels, centroids):
"""Calculate Within-Cluster Sum of Squares"""
wcss = 0
for i in range(self.k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
wcss += np.sum((cluster_points - centroids[i]) ** 2)
return wcss
def fit(self, X):
"""Fit K-Means to data"""
# Initialize centroids
self.centroids = self.initialize_centroids(X)
self.wcss_history = []
for iteration in range(self.max_iters):
# Assign clusters
labels = self.assign_clusters(X, self.centroids)
# Calculate WCSS
wcss = self.calculate_wcss(X, labels, self.centroids)
self.wcss_history.append(wcss)
# Update centroids
new_centroids = self.update_centroids(X, labels)
# Check for convergence
if np.allclose(self.centroids, new_centroids):
print(f"Converged after {iteration + 1} iterations")
break
self.centroids = new_centroids
self.labels_ = labels
return self
def predict(self, X):
"""Predict cluster labels for new data"""
return self.assign_clusters(X, self.centroids)
# Example usage
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0,
center_box=(-5.0, 5.0), random_state=42)
# Apply our K-Means
kmeans = KMeansFromScratch(k=3)
kmeans.fit(X)
# Visualize results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', alpha=0.7)
plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1],
c='red', marker='x', s=200, linewidths=3)
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.subplot(1, 2, 2)
plt.plot(kmeans.wcss_history)
plt.title('WCSS Convergence')
plt.xlabel('Iteration')
plt.ylabel('WCSS')
plt.grid(True)
plt.tight_layout()
plt.show()from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Generate sample data
X, true_labels = make_blobs(n_samples=300, centers=4,
cluster_std=0.8, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
predicted_labels = kmeans.fit_predict(X)
# Calculate metrics
silhouette_avg = silhouette_score(X, predicted_labels)
inertia = kmeans.inertia_ # WCSS
print(f"Silhouette Score: {silhouette_avg:.3f}")
print(f"Inertia (WCSS): {inertia:.2f}")
# Visualize results
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=true_labels, cmap='tab10', alpha=0.7)
plt.title('True Clusters')
plt.subplot(1, 3, 2)
plt.scatter(X[:, 0], X[:, 1], c=predicted_labels, cmap='tab10', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='x', s=200, linewidths=3)
plt.title('K-Means Results')
plt.subplot(1, 3, 3)
plt.scatter(X[:, 0], X[:, 1], c=predicted_labels, cmap='tab10', alpha=0.7)
for i, center in enumerate(kmeans.cluster_centers_):
plt.annotate(f'C{i}', center, fontsize=12, fontweight='bold')
plt.title('Cluster Centers')
plt.tight_layout()
plt.show()def plot_elbow_method(X, max_k=10):
"""Plot elbow method to find optimal k"""
wcss = []
k_range = range(1, max_k + 1)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(k_range, wcss, 'bo-')
plt.title('Elbow Method For Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS (Inertia)')
plt.grid(True)
# Calculate elbow point (simple method)
# Look for the point where improvement starts to slow down
differences = [wcss[i-1] - wcss[i] for i in range(1, len(wcss))]
elbow_k = differences.index(max(differences)) + 2 # +2 because we start from k=1
plt.axvline(x=elbow_k, color='red', linestyle='--',
label=f'Elbow at k={elbow_k}')
plt.legend()
plt.show()
return wcss, elbow_k
# Example usage
wcss_values, optimal_k = plot_elbow_method(X, max_k=8)
print(f"Suggested optimal k: {optimal_k}")from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm
def silhouette_analysis(X, max_k=10):
"""Perform silhouette analysis for different k values"""
silhouette_scores = []
k_range = range(2, max_k + 1)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
# Plot silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_range, silhouette_scores, 'go-')
plt.title('Silhouette Analysis')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid(True)
# Find optimal k
optimal_k = k_range[np.argmax(silhouette_scores)]
plt.axvline(x=optimal_k, color='red', linestyle='--',
label=f'Optimal k={optimal_k}')
plt.legend()
plt.show()
return silhouette_scores, optimal_k
# Example usage
sil_scores, optimal_k_sil = silhouette_analysis(X, max_k=8)
print(f"Optimal k by silhouette: {optimal_k_sil}")def detailed_silhouette_plot(X, k):
"""Create detailed silhouette plot for specific k"""
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
# Calculate silhouette scores
silhouette_avg = silhouette_score(X, labels)
sample_silhouette_values = silhouette_samples(X, labels)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Silhouette plot
y_lower = 10
for i in range(k):
cluster_silhouette_values = sample_silhouette_values[labels == i]
cluster_silhouette_values.sort()
size_cluster_i = cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / k)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax1.set_xlabel('Silhouette Coefficient Values')
ax1.set_ylabel('Cluster Label')
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_title(f'Silhouette Plot (k={k}, avg={silhouette_avg:.3f})')
# Cluster plot
colors = cm.nipy_spectral(labels.astype(float) / k)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, c=colors, alpha=0.7)
ax2.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
marker='x', s=300, linewidths=2, color='red')
ax2.set_title(f'Clustering Results (k={k})')
plt.tight_layout()
plt.show()
# Example usage
detailed_silhouette_plot(X, k=4)# K-Means++ initialization (default in scikit-learn)
kmeans_plus = KMeans(n_clusters=3, init='k-means++', random_state=42)
# Compare with random initialization
kmeans_random = KMeans(n_clusters=3, init='random', random_state=42)
# Fit both models
labels_plus = kmeans_plus.fit_predict(X)
labels_random = kmeans_random.fit_predict(X)
print(f"K-Means++ Inertia: {kmeans_plus.inertia_:.2f}")
print(f"Random Init Inertia: {kmeans_random.inertia_:.2f}")from sklearn.cluster import MiniBatchKMeans
import time
# Generate large dataset
X_large, _ = make_blobs(n_samples=10000, centers=5,
cluster_std=1.5, random_state=42)
# Standard K-Means
start_time = time.time()
kmeans_standard = KMeans(n_clusters=5, random_state=42)
labels_standard = kmeans_standard.fit_predict(X_large)
standard_time = time.time() - start_time
# Mini-Batch K-Means
start_time = time.time()
kmeans_mini = MiniBatchKMeans(n_clusters=5, batch_size=100, random_state=42)
labels_mini = kmeans_mini.fit_predict(X_large)
mini_time = time.time() - start_time
print(f"Standard K-Means Time: {standard_time:.3f}s")
print(f"Mini-Batch K-Means Time: {mini_time:.3f}s")
print(f"Speedup: {standard_time/mini_time:.1f}x")
# Compare results
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(labels_standard, labels_mini)
print(f"ARI between methods: {ari:.3f}")✅ Simple and Fast: Easy to implement and
computationally efficient
✅ Scalable: Works well with large datasets
✅ Guaranteed Convergence: Always converges to local
optimum
✅ Well-Understood: Extensive theoretical
foundation
✅ Requires Pre-specified k: Need to know number of
clusters
✅ Sensitive to Initialization: Different runs may give
different results
✅ Assumes Spherical Clusters: Poor performance with
non-spherical shapes
✅ Sensitive to Outliers: Outliers can significantly
affect centroids
✅ Struggles with Varying Densities: All clusters
assumed to have similar sizes
# Always scale features for K-Means
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare results with/without scaling
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Without scaling
kmeans_unscaled = KMeans(n_clusters=3, random_state=42)
labels_unscaled = kmeans_unscaled.fit_predict(X)
# With scaling
kmeans_scaled = KMeans(n_clusters=3, random_state=42)
labels_scaled = kmeans_scaled.fit_predict(X_scaled)
axes[0].scatter(X[:, 0], X[:, 1], c=labels_unscaled, alpha=0.7)
axes[0].set_title('Without Scaling')
axes[1].scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels_scaled, alpha=0.7)
axes[1].set_title('With Scaling')
plt.tight_layout()
plt.show()from sklearn.preprocessing import RobustScaler
# Use RobustScaler for data with outliers
robust_scaler = RobustScaler()
X_robust = robust_scaler.fit_transform(X)
# Or remove outliers using IQR method
def remove_outliers_iqr(data, factor=1.5):
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - factor * IQR
upper_bound = Q3 + factor * IQR
mask = (data >= lower_bound) & (data <= upper_bound)
return mask
# Apply to each feature
mask_combined = np.ones(len(X), dtype=bool)
for i in range(X.shape[1]):
mask_combined &= remove_outliers_iqr(X[:, i])
X_clean = X[mask_combined]
print(f"Removed {len(X) - len(X_clean)} outliers")def stable_kmeans(X, k, n_runs=10):
"""Run K-Means multiple times and return best result"""
best_inertia = float('inf')
best_labels = None
best_centers = None
for run in range(n_runs):
kmeans = KMeans(n_clusters=k, random_state=run, n_init=1)
labels = kmeans.fit_predict(X)
if kmeans.inertia_ < best_inertia:
best_inertia = kmeans.inertia_
best_labels = labels
best_centers = kmeans.cluster_centers_
return best_labels, best_centers, best_inertia
# Example usage
labels, centers, inertia = stable_kmeans(X, k=3, n_runs=20)
print(f"Best inertia after 20 runs: {inertia:.2f}")Hierarchical clustering constructs a hierarchy of nested partitions, represented as a binary tree (dendrogram) that encodes the entire clustering process. This approach provides richer information than flat clustering by revealing data structure at multiple scales.
Mathematical Framework: Ultrametric Spaces
Hierarchical clustering produces an ultrametric space where the distance function d satisfies the strong triangle inequality:
d(x,z) “�� max{d(x,y), d(y,z)} for all x,y,z
This property ensures that the dendrogram accurately represents cluster relationships.
Algorithmic Approaches
The choice of linkage criterion fundamentally determines the clustering behavior and geometric properties of the resulting hierarchy.
Mathematical Formulations
For clusters Cᵢ and C”��:
1. Single Linkage (Minimum) d_min(Cᵢ, C”��) = min{d(x,y) : x “�� Cᵢ, y”�� C”��}
2. Complete Linkage (Maximum)
d_max(Cᵢ, C”��) = max{d(x,y) : x “�� Cᵢ, y”�� C”��}
3. Average Linkage (UPGMA) d_avg(Cᵢ, C”��) = (1/|Cᵢ||C”��|) Σ”��“��Cᵢ Σᵧ”��C”�� d(x,y)
4. Ward Linkage (Minimum Variance) d_ward(Cᵢ, C”��) = “��(2|Cᵢ||C”��|/(|Cᵢ|+|C”��|)) ||μᵢ - μ”��||²
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import pdist, squareform
class AgglomerativeClustering:
def __init__(self, linkage='single'):
self.linkage = linkage
self.linkage_functions = {
'single': self._single_linkage,
'complete': self._complete_linkage,
'average': self._average_linkage
}
def _single_linkage(self, cluster1, cluster2, distance_matrix):
"""Minimum distance between clusters"""
distances = []
for i in cluster1:
for j in cluster2:
distances.append(distance_matrix[i, j])
return min(distances)
def _complete_linkage(self, cluster1, cluster2, distance_matrix):
"""Maximum distance between clusters"""
distances = []
for i in cluster1:
for j in cluster2:
distances.append(distance_matrix[i, j])
return max(distances)
def _average_linkage(self, cluster1, cluster2, distance_matrix):
"""Average distance between clusters"""
distances = []
for i in cluster1:
for j in cluster2:
distances.append(distance_matrix[i, j])
return np.mean(distances)
def fit(self, X):
"""Fit agglomerative clustering"""
n_samples = X.shape[0]
# Calculate pairwise distances
distances = pdist(X, metric='euclidean')
self.distance_matrix = squareform(distances)
# Initialize clusters (each point is its own cluster)
clusters = [[i] for i in range(n_samples)]
self.merge_history = []
linkage_func = self.linkage_functions[self.linkage]
# Merge clusters until only one remains
while len(clusters) > 1:
min_distance = float('inf')
merge_indices = (-1, -1)
# Find closest pair of clusters
for i in range(len(clusters)):
for j in range(i + 1, len(clusters)):
distance = linkage_func(clusters[i], clusters[j],
self.distance_matrix)
if distance < min_distance:
min_distance = distance
merge_indices = (i, j)
# Merge clusters
i, j = merge_indices
new_cluster = clusters[i] + clusters[j]
# Record merge
self.merge_history.append({
'clusters': (clusters[i].copy(), clusters[j].copy()),
'distance': min_distance,
'size': len(new_cluster)
})
# Remove old clusters and add new one
clusters = [clusters[k] for k in range(len(clusters))
if k not in [i, j]] + [new_cluster]
return self
def get_clusters(self, n_clusters):
"""Get clusters for specific number of clusters"""
if n_clusters >= len(self.merge_history) + 1:
return [[i] for i in range(len(self.merge_history) + 1)]
# Start from final merge and work backwards
clusters = [list(range(len(self.merge_history) + 1))]
for i in range(len(self.merge_history) - n_clusters + 1):
merge = self.merge_history[-(i + 1)]
# Split the merged cluster back
# This is a simplified version for demonstration
return clusters[:n_clusters]
# Example usage with simple data
np.random.seed(42)
X_simple = np.random.rand(8, 2) * 10
# Fit agglomerative clustering
agg_clustering = AgglomerativeClustering(linkage='single')
agg_clustering.fit(X_simple)
print("Merge History:")
for i, merge in enumerate(agg_clustering.merge_history):
print(f"Step {i+1}: Merge clusters at distance {merge['distance']:.3f}")from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, true_labels = make_blobs(n_samples=150, centers=3, cluster_std=1.0,
center_box=(-5, 5), random_state=42)
# Different linkage criteria
linkage_methods = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for i, linkage in enumerate(linkage_methods):
# Fit agglomerative clustering
agg_cluster = AgglomerativeClustering(n_clusters=3, linkage=linkage)
labels = agg_cluster.fit_predict(X)
# Plot results
axes[i].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
axes[i].set_title(f'{linkage.capitalize()} Linkage')
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
def plot_dendrogram(X, method='ward', max_d=None):
"""Create and plot dendrogram"""
# Calculate linkage matrix
Z = linkage(X, method=method)
plt.figure(figsize=(12, 8))
# Create dendrogram
dend = dendrogram(Z,
orientation='top',
distance_sort='descending',
show_leaf_counts=True,
leaf_font_size=10)
if max_d:
plt.axhline(y=max_d, c='red', linestyle='--',
label=f'Cut at distance {max_d}')
plt.legend()
plt.title(f'Dendrogram ({method} linkage)')
plt.xlabel('Sample Index or (Cluster Size)')
plt.ylabel('Distance')
plt.grid(True, alpha=0.3)
plt.show()
return Z
# Generate sample data
np.random.seed(42)
X_sample = np.random.rand(20, 2) * 10
# Plot dendrogram
Z = plot_dendrogram(X_sample, method='ward')
# Find optimal number of clusters from dendrogram
def find_optimal_clusters_dendrogram(Z, max_clusters=10):
"""Find optimal clusters by analyzing dendrogram gaps"""
distances = Z[:, 2] # Extract distances
# Calculate gaps between consecutive merges
gaps = np.diff(distances)
# Find largest gap (elbow point)
optimal_clusters = len(gaps) - np.argmax(gaps[::-1])
print(f"Suggested optimal clusters: {optimal_clusters}")
# Plot distance vs number of clusters
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(distances) + 1), distances, 'bo-')
plt.axvline(x=optimal_clusters, color='red', linestyle='--',
label=f'Optimal k={optimal_clusters}')
plt.title('Distance vs Number of Merges')
plt.xlabel('Merge Step')
plt.ylabel('Distance')
plt.legend()
plt.grid(True)
plt.show()
return optimal_clusters
optimal_k = find_optimal_clusters_dendrogram(Z)def interactive_dendrogram_analysis(X, method='ward'):
"""Interactive analysis of dendrogram cuts"""
Z = linkage(X, method=method)
# Different cut heights
cut_heights = np.percentile(Z[:, 2], [70, 80, 90, 95])
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()
for i, height in enumerate(cut_heights):
# Get cluster labels at this height
from scipy.cluster.hierarchy import fcluster
labels = fcluster(Z, height, criterion='distance')
n_clusters = len(np.unique(labels))
# Plot clustering result
axes[i].scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', alpha=0.7)
axes[i].set_title(f'Cut at height {height:.2f} ({n_clusters} clusters)')
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return Z, cut_heights
# Example usage
Z, heights = interactive_dendrogram_analysis(X)def compare_linkage_methods(X, n_clusters=3):
"""Compare different linkage methods"""
methods = ['single', 'complete', 'average', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for i, method in enumerate(methods):
# Hierarchical clustering
agg_cluster = AgglomerativeClustering(n_clusters=n_clusters,
linkage=method)
labels = agg_cluster.fit_predict(X)
# Dendrogram
Z = linkage(X, method=method)
# Plot clustering results
axes[0, i].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
axes[0, i].set_title(f'{method.capitalize()} Linkage\n({n_clusters} clusters)')
axes[0, i].grid(True, alpha=0.3)
# Plot dendrogram
dendrogram(Z, ax=axes[1, i], orientation='top',
distance_sort='descending', show_leaf_counts=False)
axes[1, i].set_title(f'{method.capitalize()} Dendrogram')
plt.tight_layout()
plt.show()
}
# Generate data with different cluster shapes
from sklearn.datasets import make_circles, make_moons
# Different datasets
datasets = [
make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42),
make_circles(n_samples=300, noise=0.1, factor=0.6, random_state=42),
make_moons(n_samples=300, noise=0.1, random_state=42)
]
dataset_names = ['Blobs', 'Circles', 'Moons']
for i, (X_test, _) in enumerate(datasets):
print(f"\n=== {dataset_names[i]} Dataset ===")
compare_linkage_methods(X_test, n_clusters=3)from sklearn.neighbors import kneighbors_graph
def connectivity_constrained_clustering(X, n_neighbors=3, n_clusters=3):
"""Hierarchical clustering with connectivity constraints"""
# Create connectivity graph
connectivity = kneighbors_graph(X, n_neighbors=n_neighbors,
include_self=False)
# Apply constrained clustering
agg_cluster = AgglomerativeClustering(n_clusters=n_clusters,
connectivity=connectivity,
linkage='ward')
labels_constrained = agg_cluster.fit_predict(X)
# Compare with unconstrained
agg_unconstrained = AgglomerativeClustering(n_clusters=n_clusters,
linkage='ward')
labels_unconstrained = agg_unconstrained.fit_predict(X)
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=labels_unconstrained, alpha=0.7)
axes[0].set_title('Unconstrained Clustering')
axes[1].scatter(X[:, 0], X[:, 1], c=labels_constrained, alpha=0.7)
axes[1].set_title(f'Connectivity-Constrained\n(k={n_neighbors} neighbors)')
plt.tight_layout()
plt.show()
return labels_constrained, labels_unconstrained
# Example usage
labels_conn, labels_unconn = connectivity_constrained_clustering(X)from sklearn.cluster import FeatureAgglomeration
def feature_agglomeration_example(X, n_clusters=50):
"""Demonstrate feature agglomeration for dimensionality reduction"""
# Create high-dimensional data
np.random.seed(42)
X_high_dim = np.random.rand(100, 200) # 100 samples, 200 features
# Apply feature agglomeration
feature_agg = FeatureAgglomeration(n_clusters=n_clusters)
X_reduced = feature_agg.fit_transform(X_high_dim)
print(f"Original shape: {X_high_dim.shape}")
print(f"Reduced shape: {X_reduced.shape}")
# Visualize feature clustering
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.imshow(X_high_dim[:20, :50], cmap='viridis', aspect='auto')
plt.title('Original Features (first 50)')
plt.xlabel('Features')
plt.ylabel('Samples')
plt.subplot(1, 2, 2)
plt.imshow(X_reduced[:20], cmap='viridis', aspect='auto')
plt.title(f'Agglomerated Features ({n_clusters})')
plt.xlabel('Feature Clusters')
plt.ylabel('Samples')
plt.tight_layout()
plt.show()
return X_reduced, feature_agg
# Example usage
X_reduced, feature_agg = feature_agglomeration_example(X)import time
from sklearn.metrics import silhouette_score, adjusted_rand_score
def compare_hierarchical_kmeans(X, n_clusters=3, n_runs=5):
"""Comprehensive comparison between hierarchical and K-Means"""
results = {
'method': [],
'time': [],
'silhouette': [],
'inertia': [],
'labels': []
}
# K-Means comparison
for run in range(n_runs):
# K-Means
start_time = time.time()
kmeans = KMeans(n_clusters=n_clusters, random_state=run, n_init=10)
kmeans_labels = kmeans.fit_predict(X)
kmeans_time = time.time() - start_time
results['method'].append('K-Means')
results['time'].append(kmeans_time)
results['silhouette'].append(silhouette_score(X, kmeans_labels))
results['inertia'].append(kmeans.inertia_)
results['labels'].append(kmeans_labels)
# Hierarchical (Ward)
start_time = time.time()
agg_cluster = AgglomerativeClustering(n_clusters=n_clusters,
linkage='ward')
hier_labels = agg_cluster.fit_predict(X)
hier_time = time.time() - start_time
results['method'].append('Hierarchical')
results['time'].append(hier_time)
results['silhouette'].append(silhouette_score(X, hier_labels))
results['inertia'].append(None) # No inertia for hierarchical
results['labels'].append(hier_labels)
# Create comparison dataframe
import pandas as pd
df_results = pd.DataFrame(results)
# Summary statistics
print("=== Performance Comparison ===")
summary = df_results.groupby('method').agg({
'time': ['mean', 'std'],
'silhouette': ['mean', 'std']
}).round(4)
print(summary)
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Time comparison
df_results.boxplot(column='time', by='method', ax=axes[0])
axes[0].set_title('Execution Time Comparison')
axes[0].set_ylabel('Time (seconds)')
# Silhouette comparison
df_results.boxplot(column='silhouette', by='method', ax=axes[1])
axes[1].set_title('Silhouette Score Comparison')
axes[1].set_ylabel('Silhouette Score')
# Stability comparison (agreement between runs)
kmeans_labels_list = [labels for i, labels in enumerate(results['labels'])
if results['method'][i] == 'K-Means']
hier_labels_list = [labels for i, labels in enumerate(results['labels'])
if results['method'][i] == 'Hierarchical']
# Calculate pairwise ARI for stability
def calculate_stability(labels_list):
aris = []
for i in range(len(labels_list)):
for j in range(i + 1, len(labels_list)):
ari = adjusted_rand_score(labels_list[i], labels_list[j])
aris.append(ari)
return np.mean(aris) if aris else 0
kmeans_stability = calculate_stability(kmeans_labels_list)
hier_stability = calculate_stability(hier_labels_list)
stabilities = [kmeans_stability, hier_stability]
methods = ['K-Means', 'Hierarchical']
axes[2].bar(methods, stabilities, color=['skyblue', 'lightcoral'])
axes[2].set_title('Algorithm Stability (ARI between runs)')
axes[2].set_ylabel('Average ARI')
axes[2].set_ylim(0, 1)
plt.tight_layout()
plt.show()
return df_results
# Example comparison
comparison_results = compare_hierarchical_kmeans(X, n_clusters=3)✅ Small to Medium Datasets (< 10,000
samples)
✅ Unknown Number of Clusters (Dendrogram helps
decide)
✅ Nested Cluster Structure (Hierarchical relationships
important)
✅ Deterministic Results (No random
initialization)
✅ Non-spherical Clusters (Single/Complete linkage)
✅ Large Datasets (O(n³) complexity)
✅ Speed is Critical (Slower than K-Means)
✅ Spherical Clusters with Known k (K-Means is
better)
✅ Noisy Data with Outliers (Single linkage
suffers)
def clustering_algorithm_selector(X, requirements):
"""Helper function to select appropriate clustering algorithm"""
n_samples = X.shape[0]
recommendations = []
if requirements.get('unknown_k', False):
recommendations.append("Hierarchical (use dendrogram to find k)")
if requirements.get('deterministic', False):
recommendations.append("Hierarchical (no random initialization)")
if requirements.get('large_dataset', False) or n_samples > 10000:
recommendations.append("K-Means or Mini-Batch K-Means")
if requirements.get('non_spherical', False):
recommendations.append("Hierarchical (single/complete linkage) or DBSCAN")
if requirements.get('speed_critical', False):
recommendations.append("K-Means")
if requirements.get('interpretable_hierarchy', False):
recommendations.append("Hierarchical (dendrogram provides insights)")
return recommendations
# Example usage
requirements = {
'unknown_k': True,
'deterministic': True,
'large_dataset': False,
'non_spherical': False,
'speed_critical': False,
'interpretable_hierarchy': True
}
recommendations = clustering_algorithm_selector(X, requirements)
print("Recommended algorithms:")
for rec in recommendations:
print(f"- {rec}")DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can find arbitrarily shaped clusters and automatically detect outliers.
min_samples neighbors within eps distanceeps distance of a core pointimport numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from collections import deque
class DBSCANFromScratch:
def __init__(self, eps=0.5, min_samples=5):
self.eps = eps
self.min_samples = min_samples
def _get_neighbors(self, point_idx, X):
"""Get neighbors within eps distance"""
distances = np.sqrt(np.sum((X - X[point_idx]) ** 2, axis=1))
return np.where(distances <= self.eps)[0]
def fit_predict(self, X):
"""Fit DBSCAN and return cluster labels"""
n_samples = X.shape[0]
labels = np.full(n_samples, -1) # Initialize all as noise (-1)
visited = np.zeros(n_samples, dtype=bool)
cluster_id = 0
for i in range(n_samples):
if visited[i]:
continue
visited[i] = True
neighbors = self._get_neighbors(i, X)
# Check if core point
if len(neighbors) < self.min_samples:
continue # Noise point, keep label as -1
# Start new cluster
labels[i] = cluster_id
# Expand cluster using queue (breadth-first search)
seed_set = deque(neighbors)
while seed_set:
current_point = seed_set.popleft()
if not visited[current_point]:
visited[current_point] = True
current_neighbors = self._get_neighbors(current_point, X)
# If current point is also core point, add its neighbors
if len(current_neighbors) >= self.min_samples:
seed_set.extend(current_neighbors)
# Assign to cluster if not already assigned
if labels[current_point] == -1:
labels[current_point] = cluster_id
cluster_id += 1
return labels
# Example usage
np.random.seed(42)
# Create sample data with different densities and noise
from sklearn.datasets import make_blobs
centers = [[2, 2], [-2, -2], [2, -2]]
X, _ = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=42)
# Add some noise points
noise = np.random.uniform(-4, 4, (20, 2))
X_with_noise = np.vstack([X, noise])
# Apply custom DBSCAN
dbscan_custom = DBSCANFromScratch(eps=0.8, min_samples=5)
labels_custom = dbscan_custom.fit_predict(X_with_noise)
# Visualize results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_with_noise[:, 0], X_with_noise[:, 1], c='gray', alpha=0.6)
plt.title('Original Data with Noise')
plt.subplot(1, 2, 2)
# Plot clusters and noise
unique_labels = set(labels_custom)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Noise points in black
class_member_mask = (labels_custom == k)
xy = X_with_noise[class_member_mask]
plt.scatter(xy[:, 0], xy[:, 1], c='black', marker='x', s=50, alpha=0.7)
else:
class_member_mask = (labels_custom == k)
xy = X_with_noise[class_member_mask]
plt.scatter(xy[:, 0], xy[:, 1], c=[col], s=50, alpha=0.7)
plt.title(f'DBSCAN Clustering (eps={dbscan_custom.eps}, min_samples={dbscan_custom.min_samples})')
plt.tight_layout()
plt.show()
print(f"Number of clusters: {len(set(labels_custom)) - (1 if -1 in labels_custom else 0)}")
print(f"Number of noise points: {sum(labels_custom == -1)}")from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_circles, make_moons
from sklearn.metrics import silhouette_score
def dbscan_analysis(X, eps_range=None, min_samples_range=None):
"""Comprehensive DBSCAN analysis with parameter tuning"""
if eps_range is None:
eps_range = np.arange(0.1, 1.0, 0.1)
if min_samples_range is None:
min_samples_range = range(3, 11)
# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Parameter tuning
results = []
for eps in eps_range:
for min_samples in min_samples_range:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = sum(labels == -1)
if n_clusters > 1 and n_clusters < len(X) - 1: # Valid clustering
silhouette_avg = silhouette_score(X_scaled, labels)
else:
silhouette_avg = -1
results.append({
'eps': eps,
'min_samples': min_samples,
'n_clusters': n_clusters,
'n_noise': n_noise,
'silhouette': silhouette_avg
})
# Convert to DataFrame for analysis
import pandas as pd
results_df = pd.DataFrame(results)
# Find best parameters
valid_results = results_df[results_df['silhouette'] > 0]
if not valid_results.empty:
best_result = valid_results.loc[valid_results['silhouette'].idxmax()]
print(f"Best parameters: eps={best_result['eps']:.2f}, min_samples={best_result['min_samples']}")
print(f"Best silhouette: {best_result['silhouette']:.3f}")
# Apply best DBSCAN
best_dbscan = DBSCAN(eps=best_result['eps'],
min_samples=best_result['min_samples'])
best_labels = best_dbscan.fit_predict(X_scaled)
return best_labels, best_result
else:
print("No valid clustering found with given parameter ranges")
return None, None
# Test on different datasets
datasets = [
('Blobs', make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)),
('Circles', make_circles(n_samples=300, noise=0.05, factor=0.6, random_state=42)),
('Moons', make_moons(n_samples=300, noise=0.1, random_state=42))
]
fig, axes = plt.subplots(len(datasets), 2, figsize=(12, 4 * len(datasets)))
for i, (name, (X_test, _)) in enumerate(datasets):
# Original data
axes[i, 0].scatter(X_test[:, 0], X_test[:, 1], alpha=0.7)
axes[i, 0].set_title(f'{name} - Original Data')
# DBSCAN results
labels, best_params = dbscan_analysis(X_test)
if labels is not None:
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
class_member_mask = (labels == k)
xy = X_test[class_member_mask]
axes[i, 1].scatter(xy[:, 0], xy[:, 1], c='black',
marker='x', s=50, alpha=0.7, label='Noise')
else:
class_member_mask = (labels == k)
xy = X_test[class_member_mask]
axes[i, 1].scatter(xy[:, 0], xy[:, 1], c=[col], s=50, alpha=0.7,
label=f'Cluster {k}')
axes[i, 1].set_title(f'{name} - DBSCAN Results')
plt.tight_layout()
plt.show()1. K-Distance Plot for Eps Selection
def plot_k_distance(X, k=4):
"""Plot k-distance graph to help choose eps parameter"""
from sklearn.neighbors import NearestNeighbors
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find k-nearest neighbors
nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, indices = nbrs.kneighbors(X_scaled)
# Sort distances to k-th nearest neighbor
distances = np.sort(distances[:, k-1], axis=0)[::-1]
plt.figure(figsize=(10, 6))
plt.plot(range(len(distances)), distances)
plt.title(f'{k}-Distance Plot for Eps Selection')
plt.xlabel('Points sorted by distance')
plt.ylabel(f'Distance to {k}th nearest neighbor')
plt.grid(True)
# Find elbow point (knee point)
# Calculate second derivative to find inflection point
second_derivative = np.gradient(np.gradient(distances))
knee_point = np.argmax(second_derivative)
suggested_eps = distances[knee_point]
plt.axhline(y=suggested_eps, color='red', linestyle='--',
label=f'Suggested eps: {suggested_eps:.3f}')
plt.legend()
plt.show()
return suggested_eps
# Example usage
suggested_eps = plot_k_distance(X_with_noise, k=4)
print(f"Suggested eps value: {suggested_eps:.3f}")Gaussian Mixture Models assume that data comes from a mixture of Gaussian distributions. Unlike K-Means (hard clustering), GMM provides soft clustering with probability assignments.
A GMM assumes data is generated by a mixture of K Gaussian components:
p(x) = Σ(k=1 to K) �"k * N(x | μk, Σk)
Where: - �“k = mixing coefficient (weight) of component k - N(x | μk, Σk) = Gaussian distribution with mean μk and covariance Σk
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
def plot_gmm_results(X, gmm, title="GMM Clustering"):
"""Plot GMM clustering results with confidence ellipses"""
# Predict cluster labels and probabilities
labels = gmm.predict(X)
probs = gmm.predict_proba(X)
plt.figure(figsize=(12, 5))
# Plot hard clustering (most likely cluster)
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.title(f'{title} - Hard Clustering')
# Draw confidence ellipses
for i in range(gmm.n_components):
mean = gmm.means_[i]
cov = gmm.covariances_[i]
# Calculate ellipse parameters
eigenvals, eigenvecs = np.linalg.eigh(cov)
angle = np.degrees(np.arctan2(eigenvecs[1, 0], eigenvecs[0, 0]))
width, height = 2 * np.sqrt(eigenvals)
ellipse = Ellipse(mean, width, height, angle=angle,
alpha=0.3, facecolor=plt.cm.viridis(i / gmm.n_components))
plt.gca().add_patch(ellipse)
plt.scatter(mean[0], mean[1], c='red', marker='x', s=100, linewidths=3)
# Plot soft clustering (probability-weighted)
plt.subplot(1, 2, 2)
# Color points based on maximum probability
max_probs = np.max(probs, axis=1)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis',
s=50 * max_probs, alpha=0.7)
plt.title(f'{title} - Soft Clustering\n(Size = Confidence)')
plt.tight_layout()
plt.show()
# Generate sample data
np.random.seed(42)
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)
# Fit GMM
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
plot_gmm_results(X, gmm, "Gaussian Mixture Model")
# Print model parameters
print("GMM Parameters:")
print(f"Mixing coefficients (weights): {gmm.weights_}")
print(f"Means:\n{gmm.means_}")
print(f"Covariances:\n{gmm.covariances_}")from sklearn.model_selection import cross_val_score
def gmm_model_selection(X, max_components=10, cv_folds=5):
"""Select optimal number of components using various criteria"""
n_components_range = range(1, max_components + 1)
# Storage for different criteria
bic_scores = []
aic_scores = []
log_likelihoods = []
cv_scores = []
for n_components in n_components_range:
# Fit GMM
gmm = GaussianMixture(n_components=n_components, random_state=42)
gmm.fit(X)
# Calculate information criteria
bic_scores.append(gmm.bic(X))
aic_scores.append(gmm.aic(X))
log_likelihoods.append(gmm.score(X))
# Cross-validation score
cv_score = cross_val_score(gmm, X, cv=cv_folds).mean()
cv_scores.append(cv_score)
# Plot results
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(n_components_range, bic_scores, 'bo-')
axes[0, 0].set_title('BIC Score (Lower is Better)')
axes[0, 0].set_xlabel('Number of Components')
axes[0, 0].grid(True)
axes[0, 1].plot(n_components_range, aic_scores, 'ro-')
axes[0, 1].set_title('AIC Score (Lower is Better)')
axes[0, 1].set_xlabel('Number of Components')
axes[0, 1].grid(True)
axes[1, 0].plot(n_components_range, log_likelihoods, 'go-')
axes[1, 0].set_title('Log-Likelihood (Higher is Better)')
axes[1, 0].set_xlabel('Number of Components')
axes[1, 0].grid(True)
axes[1, 1].plot(n_components_range, cv_scores, 'mo-')
axes[1, 1].set_title('Cross-Validation Score (Higher is Better)')
axes[1, 1].set_xlabel('Number of Components')
axes[1, 1].grid(True)
plt.tight_layout()
plt.show()
# Find optimal number of components
optimal_bic = n_components_range[np.argmin(bic_scores)]
optimal_aic = n_components_range[np.argmin(aic_scores)]
optimal_cv = n_components_range[np.argmax(cv_scores)]
print(f"Optimal components by BIC: {optimal_bic}")
print(f"Optimal components by AIC: {optimal_aic}")
print(f"Optimal components by CV: {optimal_cv}")
return {
'bic_scores': bic_scores,
'aic_scores': aic_scores,
'optimal_bic': optimal_bic,
'optimal_aic': optimal_aic,
'optimal_cv': optimal_cv
}
# Example usage
selection_results = gmm_model_selection(X, max_components=8)def compare_gmm_covariance_types(X, n_components=3):
"""Compare different covariance types for GMM"""
covariance_types = ['full', 'tied', 'diag', 'spherical']
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for i, cov_type in enumerate(covariance_types):
# Fit GMM with specific covariance type
gmm = GaussianMixture(n_components=n_components,
covariance_type=cov_type,
random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
# Plot results
axes[i].scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
# Draw confidence ellipses (for full and tied covariance)
if cov_type in ['full', 'tied']:
for j in range(gmm.n_components):
mean = gmm.means_[j]
if cov_type == 'full':
cov = gmm.covariances_[j]
else: # tied
cov = gmm.covariances_
eigenvals, eigenvecs = np.linalg.eigh(cov)
angle = np.degrees(np.arctan2(eigenvecs[1, 0], eigenvecs[0, 0]))
width, height = 2 * np.sqrt(eigenvals)
ellipse = Ellipse(mean, width, height, angle=angle,
alpha=0.3, facecolor=plt.cm.viridis(j / gmm.n_components))
axes[i].add_patch(ellipse)
# Mark cluster centers
axes[i].scatter(gmm.means_[:, 0], gmm.means_[:, 1],
c='red', marker='x', s=100, linewidths=3)
axes[i].set_title(f'{cov_type.capitalize()} Covariance\nBIC: {gmm.bic(X):.1f}')
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
}
# Example usage
compare_gmm_covariance_types(X, n_components=3)from sklearn.metrics import (silhouette_score, silhouette_samples,
calinski_harabasz_score, davies_bouldin_score,
adjusted_rand_score, normalized_mutual_info_score,
homogeneity_completeness_v_measure)
def comprehensive_clustering_evaluation(X, labels, true_labels=None):
"""Comprehensive evaluation of clustering results"""
# Remove noise points for internal metrics (if any)
mask = labels != -1
X_clean = X[mask] if np.any(mask) else X
labels_clean = labels[mask] if np.any(mask) else labels
results = {}
# Internal metrics (no ground truth needed)
if len(np.unique(labels_clean)) > 1:
results['silhouette'] = silhouette_score(X_clean, labels_clean)
results['calinski_harabasz'] = calinski_harabasz_score(X_clean, labels_clean)
results['davies_bouldin'] = davies_bouldin_score(X_clean, labels_clean)
else:
results['silhouette'] = -1
results['calinski_harabasz'] = 0
results['davies_bouldin'] = float('inf')
# Cluster statistics
unique_labels, counts = np.unique(labels, return_counts=True)
results['n_clusters'] = len(unique_labels) - (1 if -1 in unique_labels else 0)
results['n_noise'] = counts[unique_labels == -1][0] if -1 in unique_labels else 0
results['cluster_sizes'] = counts[unique_labels != -1] if -1 in unique_labels else counts
# External metrics (if ground truth available)
if true_labels is not None:
if len(np.unique(labels_clean)) > 1 and len(np.unique(true_labels[mask])) > 1:
results['adjusted_rand_score'] = adjusted_rand_score(true_labels[mask], labels_clean)
results['normalized_mutual_info'] = normalized_mutual_info_score(true_labels[mask], labels_clean)
# Homogeneity, Completeness, V-measure
h, c, v = homogeneity_completeness_v_measure(true_labels[mask], labels_clean)
results['homogeneity'] = h
results['completeness'] = c
results['v_measure'] = v
else:
results['adjusted_rand_score'] = 0
results['normalized_mutual_info'] = 0
results['homogeneity'] = 0
results['completeness'] = 0
results['v_measure'] = 0
return results
def visualize_clustering_evaluation(X, algorithms_results, true_labels=None):
"""Visualize clustering results and evaluation metrics"""
n_algorithms = len(algorithms_results)
# Create subplots
if true_labels is not None:
fig, axes = plt.subplots(2, n_algorithms + 1, figsize=(4 * (n_algorithms + 1), 8))
# Plot ground truth
axes[0, 0].scatter(X[:, 0], X[:, 1], c=true_labels, cmap='tab10', alpha=0.7)
axes[0, 0].set_title('Ground Truth')
axes[1, 0].axis('off') # Empty space
start_col = 1
else:
fig, axes = plt.subplots(2, n_algorithms, figsize=(4 * n_algorithms, 8))
start_col = 0
evaluation_results = {}
for i, (name, labels) in enumerate(algorithms_results.items()):
col = start_col + i
# Plot clustering results
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col_val in zip(unique_labels, colors):
if k == -1:
# Noise points
class_member_mask = (labels == k)
xy = X[class_member_mask]
axes[0, col].scatter(xy[:, 0], xy[:, 1], c='black',
marker='x', s=50, alpha=0.7, label='Noise')
else:
class_member_mask = (labels == k)
xy = X[class_member_mask]
axes[0, col].scatter(xy[:, 0], xy[:, 1], c=[col_val], s=50, alpha=0.7,
label=f'Cluster {k}')
axes[0, col].set_title(f'{name}')
# Evaluate clustering
eval_results = comprehensive_clustering_evaluation(X, labels, true_labels)
evaluation_results[name] = eval_results
# Plot evaluation metrics
metrics = ['silhouette', 'calinski_harabasz', 'davies_bouldin']
if true_labels is not None:
metrics.extend(['adjusted_rand_score', 'normalized_mutual_info'])
metric_names = []
metric_values = []
for metric in metrics:
if metric in eval_results:
metric_names.append(metric.replace('_', ' ').title())
metric_values.append(eval_results[metric])
# Bar plot of metrics
bars = axes[1, col].bar(range(len(metric_values)), metric_values)
axes[1, col].set_xticks(range(len(metric_names)))
axes[1, col].set_xticklabels(metric_names, rotation=45, ha='right')
axes[1, col].set_title(f'{name} Metrics')
# Color bars based on "goodness" (green=good, red=bad)
for j, (bar, metric) in enumerate(zip(bars, metrics)):
if metric in ['silhouette', 'calinski_harabasz', 'adjusted_rand_score',
'normalized_mutual_info', 'homogeneity', 'completeness', 'v_measure']:
# Higher is better
bar.set_color('green' if metric_values[j] > 0.5 else 'orange')
elif metric == 'davies_bouldin':
# Lower is better
bar.set_color('green' if metric_values[j] < 2.0 else 'orange')
plt.tight_layout()
plt.show()
return evaluation_results
# Example comprehensive comparison
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
# Generate complex dataset
np.random.seed(42)
X_complex, true_labels = make_blobs(n_samples=300, centers=4,
cluster_std=1.0, random_state=42)
# Add some noise
noise = np.random.uniform(-6, 6, (30, 2))
X_complex = np.vstack([X_complex, noise])
true_labels = np.concatenate([true_labels, [-1] * 30]) # -1 for noise
# Apply different algorithms
algorithms = {
'K-Means': KMeans(n_clusters=4, random_state=42, n_init=10),
'Hierarchical': AgglomerativeClustering(n_clusters=4, linkage='ward'),
'DBSCAN': DBSCAN(eps=0.5, min_samples=10),
'GMM': GaussianMixture(n_components=4, random_state=42)
}
# Evaluate and visualize
evaluation_results = visualize_clustering_evaluation(X_complex, algorithms, true_labels)
# Print detailed results
print("\n=== Detailed Evaluation Results ===")
for name, results in evaluation_results.items():
print(f"\n{name}:")
for metric, value in results.items():
if isinstance(value, (int, float)):
print(f" {metric}: {value:.3f}")
else:
print(f" {metric}: {value}")| Algorithm | Best Use Cases | Advantages | Limitations |
|---|---|---|---|
| K-Means | Spherical clusters, known k | Fast, simple, scalable | Assumes spherical shapes, needs k |
| Hierarchical | Unknown k, small datasets | No k needed, deterministic | Slow O(n³), sensitive to outliers |
| DBSCAN | Non-spherical, noise handling | Handles noise, arbitrary shapes | Sensitive to parameters, struggles with varying densities |
| GMM | Soft clustering, overlapping clusters | Probabilistic, handles overlaps | Assumes Gaussian distributions, needs k |
def clustering_algorithm_recommender(X, requirements):
"""Recommend clustering algorithm based on data characteristics"""
n_samples, n_features = X.shape
recommendations = []
# Data size considerations
if n_samples > 10000:
recommendations.append("Consider K-Means or Mini-Batch K-Means for large datasets")
# Cluster shape considerations
if requirements.get('arbitrary_shapes', False):
recommendations.append("DBSCAN for arbitrary cluster shapes")
# Noise handling
if requirements.get('noise_present', False):
recommendations.append("DBSCAN for automatic noise detection")
# Probability estimates
if requirements.get('soft_clustering', False):
recommendations.append("Gaussian Mixture Models for soft clustering")
# Number of clusters
if requirements.get('unknown_k', False):
recommendations.append("Hierarchical clustering or DBSCAN (no k required)")
# Interpretability
if requirements.get('hierarchical_structure', False):
recommendations.append("Hierarchical clustering for tree-like structure")
# Speed requirements
if requirements.get('speed_critical', False):
recommendations.append("K-Means for fastest performance")
return recommendations
# Example usage
data_requirements = {
'arbitrary_shapes': True,
'noise_present': True,
'unknown_k': True,
'soft_clustering': False,
'hierarchical_structure': False,
'speed_critical': False
}
recommendations = clustering_algorithm_recommender(X_complex, data_requirements)
print("Algorithm Recommendations:")
for rec in recommendations:
print(f"- {rec}")An e-commerce company wants to segment customers based on their purchasing behavior to create targeted marketing campaigns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
# Create synthetic customer dataset
np.random.seed(42)
def generate_customer_data(n_customers=1000):
"""Generate realistic customer data for segmentation"""
# Define customer segments
segments = {
'High Value': {'size': 200, 'annual_spend': (5000, 15000),
'frequency': (50, 100), 'avg_order': (100, 300)},
'Medium Value': {'size': 500, 'annual_spend': (1000, 5000),
'frequency': (20, 50), 'avg_order': (50, 150)},
'Low Value': {'size': 250, 'annual_spend': (100, 1000),
'frequency': (5, 20), 'avg_order': (20, 80)},
'Churned': {'size': 50, 'annual_spend': (0, 200),
'frequency': (0, 5), 'avg_order': (10, 50)}
}
customer_data = []
for segment, params in segments.items():
for _ in range(params['size']):
customer = {
'customer_id': len(customer_data) + 1,
'annual_spending': np.random.uniform(*params['annual_spend']),
'purchase_frequency': np.random.uniform(*params['frequency']),
'avg_order_value': np.random.uniform(*params['avg_order']),
'months_since_last_purchase': np.random.exponential(2),
'true_segment': segment
}
customer_data.append(customer)
# Add derived features
for customer in customer_data:
customer['customer_lifetime_value'] = (
customer['annual_spending'] *
(1 + customer['purchase_frequency'] / 50)
)
customer['engagement_score'] = (
customer['purchase_frequency'] /
(1 + customer['months_since_last_purchase'])
)
return pd.DataFrame(customer_data)
# Generate and explore data
customer_df = generate_customer_data()
print("Customer Dataset Overview:")
print(customer_df.head())
print(f"\nDataset shape: {customer_df.shape}")
print("\nFeature statistics:")
print(customer_df.describe())
# Visualize feature distributions
features = ['annual_spending', 'purchase_frequency', 'avg_order_value',
'customer_lifetime_value', 'engagement_score']
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
for i, feature in enumerate(features):
axes[i].hist(customer_df[feature], bins=30, alpha=0.7, edgecolor='black')
axes[i].set_title(f'{feature.replace("_", " ").title()}')
axes[i].grid(True, alpha=0.3)
# Remove empty subplot
axes[5].axis('off')
plt.tight_layout()
plt.show()def preprocess_customer_data(df):
"""Preprocess customer data for clustering"""
# Select features for clustering
clustering_features = [
'annual_spending', 'purchase_frequency', 'avg_order_value',
'customer_lifetime_value', 'engagement_score'
]
X = df[clustering_features].copy()
# Handle any missing values
X = X.fillna(X.median())
# Log transform skewed features
skewed_features = ['annual_spending', 'customer_lifetime_value']
for feature in skewed_features:
X[f'{feature}_log'] = np.log1p(X[feature])
# Create RFM-like scores
X['recency_score'] = 1 / (1 + df['months_since_last_purchase'])
X['frequency_score'] = np.log1p(df['purchase_frequency'])
X['monetary_score'] = np.log1p(df['annual_spending'])
# Standardize features
scaler = StandardScaler()
feature_cols = X.columns
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=feature_cols, index=X.index)
return X_scaled_df, scaler, feature_cols
X_processed, scaler, feature_names = preprocess_customer_data(customer_df)
print("Processed features shape:", X_processed.shape)
print("Feature names:", list(feature_names))
# Correlation analysis
plt.figure(figsize=(12, 8))
correlation_matrix = X_processed.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()def customer_segmentation_analysis(X, customer_df):
"""Apply multiple clustering algorithms for customer segmentation"""
# Define algorithms to test
algorithms = {
'K-Means': KMeans(n_clusters=4, random_state=42, n_init=10),
'Hierarchical': AgglomerativeClustering(n_clusters=4, linkage='ward'),
'DBSCAN': DBSCAN(eps=0.5, min_samples=10),
'GMM': GaussianMixture(n_components=4, random_state=42)
}
results = {}
# Apply each algorithm
for name, algorithm in algorithms.items():
print(f"\nApplying {name}...")
if name == 'DBSCAN':
# For DBSCAN, we need to tune parameters
from sklearn.neighbors import NearestNeighbors
# Find optimal eps using k-distance plot
nbrs = NearestNeighbors(n_neighbors=10).fit(X)
distances, indices = nbrs.kneighbors(X)
distances = np.sort(distances[:, 9], axis=0)[::-1]
# Use elbow method to find eps
second_derivative = np.gradient(np.gradient(distances))
knee_point = np.argmax(second_derivative[:len(distances)//3]) # Look at first third
optimal_eps = distances[knee_point]
algorithm = DBSCAN(eps=optimal_eps, min_samples=10)
# Fit and predict
if hasattr(algorithm, 'fit_predict'):
labels = algorithm.fit_predict(X)
else:
labels = algorithm.fit(X).predict(X)
# Calculate metrics
if len(set(labels)) > 1 and -1 not in labels:
silhouette = silhouette_score(X, labels)
elif len(set(labels)) > 1:
# Handle DBSCAN with noise
mask = labels != -1
if np.sum(mask) > 1 and len(set(labels[mask])) > 1:
silhouette = silhouette_score(X[mask], labels[mask])
else:
silhouette = -1
else:
silhouette = -1
results[name] = {
'labels': labels,
'silhouette_score': silhouette,
'n_clusters': len(set(labels)) - (1 if -1 in labels else 0),
'n_noise': np.sum(labels == -1) if -1 in labels else 0
}
print(f" Clusters: {results[name]['n_clusters']}")
print(f" Noise points: {results[name]['n_noise']}")
print(f" Silhouette Score: {results[name]['silhouette_score']:.3f}")
return results
# Perform segmentation analysis
segmentation_results = customer_segmentation_analysis(X_processed.values, customer_df)
# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()
for i, (name, result) in enumerate(segmentation_results.items()):
labels = result['labels']
# Use first two principal components for visualization
from sklearn.decomposition import PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_processed)
# Plot clusters
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Noise points
class_member_mask = (labels == k)
xy = X_pca[class_member_mask]
axes[i].scatter(xy[:, 0], xy[:, 1], c='black', marker='x',
s=50, alpha=0.7, label='Noise')
else:
class_member_mask = (labels == k)
xy = X_pca[class_member_mask]
axes[i].scatter(xy[:, 0], xy[:, 1], c=[col], s=50, alpha=0.7,
label=f'Cluster {k}')
axes[i].set_title(f'{name}\nSilhouette: {result["silhouette_score"]:.3f}')
axes[i].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2f})')
axes[i].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2f})')
axes[i].legend()
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()def analyze_customer_segments(customer_df, labels, algorithm_name):
"""Analyze and interpret customer segments"""
# Add cluster labels to dataframe
df_analysis = customer_df.copy()
df_analysis['predicted_cluster'] = labels
# Remove noise points for analysis
if -1 in labels:
df_clean = df_analysis[df_analysis['predicted_cluster'] != -1]
print(f"Removed {sum(labels == -1)} noise points for analysis")
else:
df_clean = df_analysis
print(f"\n=== {algorithm_name} Segment Analysis ===")
# Segment characteristics
segment_summary = df_clean.groupby('predicted_cluster').agg({
'annual_spending': ['mean', 'median', 'std'],
'purchase_frequency': ['mean', 'median'],
'avg_order_value': ['mean', 'median'],
'customer_lifetime_value': ['mean', 'median'],
'engagement_score': ['mean', 'median'],
'months_since_last_purchase': ['mean', 'median']
}).round(2)
print("\nSegment Summary Statistics:")
print(segment_summary)
# Segment sizes
segment_sizes = df_clean['predicted_cluster'].value_counts().sort_index()
print(f"\nSegment Sizes:")
for cluster, size in segment_sizes.items():
percentage = (size / len(df_clean)) * 100
print(f" Cluster {cluster}: {size} customers ({percentage:.1f}%)")
# Business positioning insights
print(f"\n=== Business Insights for {algorithm_name} ===")
for cluster in sorted(df_clean['predicted_cluster'].unique()):
cluster_data = df_clean[df_clean['predicted_cluster'] == cluster]
avg_spending = cluster_data['annual_spending'].mean()
avg_frequency = cluster_data['purchase_frequency'].mean()
avg_order = cluster_data['avg_order_value'].mean()
avg_clv = cluster_data['customer_lifetime_value'].mean()
print(f"\nCluster {cluster} Profile:")
print(f" Size: {len(cluster_data)} customers")
print(f" Avg Annual Spending: ${avg_spending:,.0f}")
print(f" Avg Purchase Frequency: {avg_frequency:.1f} times/year")
print(f" Avg Order Value: ${avg_order:.0f}")
print(f" Avg Customer Lifetime Value: ${avg_clv:,.0f}")
# Segment classification
if avg_spending > 5000 and avg_frequency > 40:
segment_type = "🌟 VIP Customers - High value, frequent buyers"
elif avg_spending > 2000 and avg_frequency > 20:
segment_type = "💎 Loyal Customers - Regular, valuable buyers"
elif avg_spending < 1000 and avg_frequency < 15:
segment_type = "📈 Growth Potential - Low engagement, needs attention"
else:
segment_type = ✅�️ Balanced Customers - Moderate engagement"
print(f" Segment Type: {segment_type}")
# Marketing recommendations
if "VIP" in segment_type:
print(" 📋 Marketing Strategy: Premium service, exclusive offers, loyalty rewards")
elif "Loyal" in segment_type:
print(" 📋 Marketing Strategy: Retention programs, cross-selling, referral incentives")
elif "Growth" in segment_type:
print(" 📋 Marketing Strategy: Re-engagement campaigns, special promotions, onboarding")
else:
print(" 📋 Marketing Strategy: Upselling, engagement programs, targeted offers")
return df_analysis
# Analyze the best performing algorithm (highest silhouette score)
best_algorithm = max(segmentation_results.keys(),
key=lambda k: segmentation_results[k]['silhouette_score'])
best_labels = segmentation_results[best_algorithm]['labels']
print(f"Best performing algorithm: {best_algorithm}")
customer_analysis = analyze_customer_segments(customer_df, best_labels, best_algorithm)
# Create business dashboard visualization
def create_segmentation_dashboard(df_analysis):
"""Create a business dashboard for customer segmentation"""
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Remove noise points for visualization
df_viz = df_analysis[df_analysis['predicted_cluster'] != -1].copy()
# 1. Segment sizes pie chart
segment_sizes = df_viz['predicted_cluster'].value_counts()
axes[0, 0].pie(segment_sizes.values, labels=[f'Segment {i}' for i in segment_sizes.index],
autopct='%1.1f%%', startangle=90)
axes[0, 0].set_title('Customer Segment Distribution')
# 2. Annual spending by segment
df_viz.boxplot(column='annual_spending', by='predicted_cluster', ax=axes[0, 1])
axes[0, 1].set_title('Annual Spending by Segment')
axes[0, 1].set_xlabel('Segment')
axes[0, 1].set_ylabel('Annual Spending ($)')
# 3. Purchase frequency by segment
df_viz.boxplot(column='purchase_frequency', by='predicted_cluster', ax=axes[0, 2])
axes[0, 2].set_title('Purchase Frequency by Segment')
axes[0, 2].set_xlabel('Segment')
axes[0, 2].set_ylabel('Purchases per Year')
# 4. Customer Lifetime Value by segment
segment_clv = df_viz.groupby('predicted_cluster')['customer_lifetime_value'].mean()
bars = axes[1, 0].bar(range(len(segment_clv)), segment_clv.values)
axes[1, 0].set_title('Average Customer Lifetime Value by Segment')
axes[1, 0].set_xlabel('Segment')
axes[1, 0].set_ylabel('CLV ($)')
axes[1, 0].set_xticks(range(len(segment_clv)))
axes[1, 0].set_xticklabels([f'Segment {i}' for i in segment_clv.index])
# Add value labels on bars
for bar, value in zip(bars, segment_clv.values):
axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + value*0.01,
f'${value:,.0f}', ha='center', va='bottom')
# 5. Engagement score distribution
df_viz.boxplot(column='engagement_score', by='predicted_cluster', ax=axes[1, 1])
axes[1, 1].set_title('Engagement Score by Segment')
axes[1, 1].set_xlabel('Segment')
axes[1, 1].set_ylabel('Engagement Score')
# 6. Revenue contribution
segment_revenue = df_viz.groupby('predicted_cluster')['annual_spending'].sum()
total_revenue = segment_revenue.sum()
revenue_pct = (segment_revenue / total_revenue * 100)
bars = axes[1, 2].bar(range(len(revenue_pct)), revenue_pct.values)
axes[1, 2].set_title('Revenue Contribution by Segment')
axes[1, 2].set_xlabel('Segment')
axes[1, 2].set_ylabel('Revenue Contribution (%)')
axes[1, 2].set_xticks(range(len(revenue_pct)))
axes[1, 2].set_xticklabels([f'Segment {i}' for i in revenue_pct.index])
# Add percentage labels
for bar, value in zip(bars, revenue_pct.values):
axes[1, 2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + value*0.01,
f'{value:.1f}%', ha='center', va='bottom')
plt.tight_layout()
plt.show()
create_segmentation_dashboard(customer_analysis)Analyze product features and customer preferences to identify market segments and positioning opportunities.
def market_research_case_study():
"""Market research clustering case study"""
# Generate product features dataset
np.random.seed(42)
products = []
categories = ['Electronics', 'Fashion', 'Home', 'Sports', 'Books']
for i in range(500):
category = np.random.choice(categories)
# Category-specific feature generation
if category == 'Electronics':
price = np.random.lognormal(6, 1) # Higher prices
quality_rating = np.random.normal(4.2, 0.5)
innovation_score = np.random.normal(7.5, 1.5)
elif category == 'Fashion':
price = np.random.lognormal(4, 1.2)
quality_rating = np.random.normal(3.8, 0.7)
innovation_score = np.random.normal(6.0, 2.0)
elif category == 'Home':
price = np.random.lognormal(5, 1.5)
quality_rating = np.random.normal(4.0, 0.6)
innovation_score = np.random.normal(5.5, 1.8)
elif category == 'Sports':
price = np.random.lognormal(4.5, 1.3)
quality_rating = np.random.normal(4.1, 0.5)
innovation_score = np.random.normal(6.5, 1.2)
else: # Books
price = np.random.lognormal(2.5, 0.8)
quality_rating = np.random.normal(4.3, 0.4)
innovation_score = np.random.normal(4.0, 1.0)
# Ensure realistic ranges
quality_rating = np.clip(quality_rating, 1, 5)
innovation_score = np.clip(innovation_score, 1, 10)
product = {
'product_id': i + 1,
'category': category,
'price': price,
'quality_rating': quality_rating,
'innovation_score': innovation_score,
'brand_strength': np.random.normal(5, 2),
'market_share': np.random.exponential(2),
'customer_satisfaction': np.random.normal(3.8, 0.8)
}
# Ensure reasonable ranges
product['brand_strength'] = np.clip(product['brand_strength'], 1, 10)
product['market_share'] = np.clip(product['market_share'], 0.1, 15)
product['customer_satisfaction'] = np.clip(product['customer_satisfaction'], 1, 5)
products.append(product)
products_df = pd.DataFrame(products)
print("Market Research Dataset:")
print(products_df.head())
print(f"\nDataset shape: {products_df.shape}")
print(f"\nCategories: {products_df['category'].unique()}")
# Preprocessing for clustering
feature_cols = ['price', 'quality_rating', 'innovation_score',
'brand_strength', 'market_share', 'customer_satisfaction']
X_market = products_df[feature_cols].copy()
# Log transform skewed features
X_market['price_log'] = np.log1p(X_market['price'])
X_market['market_share_log'] = np.log1p(X_market['market_share'])
# Standardize features
scaler = StandardScaler()
X_market_scaled = scaler.fit_transform(X_market)
# Apply clustering
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X_market_scaled)
products_df['market_segment'] = cluster_labels
# Analyze segments
print(f"\n=== Market Segment Analysis ===")
segment_analysis = products_df.groupby('market_segment').agg({
'price': ['mean', 'median'],
'quality_rating': ['mean', 'std'],
'innovation_score': ['mean', 'std'],
'brand_strength': ['mean', 'std'],
'market_share': ['mean', 'sum'],
'customer_satisfaction': 'mean'
}).round(2)
print("\nSegment Characteristics:")
print(segment_analysis)
# Business positioning insights
for segment in range(n_clusters):
segment_data = products_df[products_df['market_segment'] == segment]
avg_price = segment_data['price'].mean()
avg_quality = segment_data['quality_rating'].mean()
avg_innovation = segment_data['innovation_score'].mean()
print(f"\nSegment {segment} - Market Position:")
print(f" Products: {len(segment_data)}")
print(f" Avg Price: ${avg_price:.2f}")
print(f" Quality Rating: {avg_quality:.2f}/5")
print(f" Innovation Score: {avg_innovation:.2f}/10")
# Position classification
if avg_price > products_df['price'].median() and avg_quality > 4.0:
position = "Premium Segment - High price, high quality"
elif avg_price < products_df['price'].median() and avg_quality < 3.5:
position = "Budget Segment - Low price, basic quality"
elif avg_innovation > 7.0:
position = "Innovation Leaders - High tech, early adopters"
else:
position = "Mainstream Segment - Balanced offerings"
print(f" Market Position: {position}")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Price vs Quality positioning map
scatter = axes[0, 0].scatter(products_df['price'], products_df['quality_rating'],
c=products_df['market_segment'], cmap='viridis', alpha=0.7)
axes[0, 0].set_xlabel('Price ($)')
axes[0, 0].set_ylabel('Quality Rating')
axes[0, 0].set_title('Price vs Quality Market Map')
plt.colorbar(scatter, ax=axes[0, 0])
# Innovation vs Brand Strength
scatter = axes[0, 1].scatter(products_df['innovation_score'], products_df['brand_strength'],
c=products_df['market_segment'], cmap='viridis', alpha=0.7)
axes[0, 1].set_xlabel('Innovation Score')
axes[0, 1].set_ylabel('Brand Strength')
axes[0, 1].set_title('Innovation vs Brand Strength')
plt.colorbar(scatter, ax=axes[0, 1])
# Market share distribution by segment
products_df.boxplot(column='market_share', by='market_segment', ax=axes[1, 0])
axes[1, 0].set_title('Market Share by Segment')
# Customer satisfaction by segment
products_df.boxplot(column='customer_satisfaction', by='market_segment', ax=axes[1, 1])
axes[1, 1].set_title('Customer Satisfaction by Segment')
plt.tight_layout()
plt.show()
return products_df
products_analysis = market_research_case_study()Difficulty: Medium
Implement a simple version of K-Means++ initialization and compare its performance with random initialization.
# Exercise 6.1 Solution Template
def kmeans_plus_plus_init(X, k):
"""
Implement K-Means++ initialization
Parameters:
X: data points
k: number of clusters
Returns:
centroids: initial centroids using K-Means++
"""
# TODO: Implement K-Means++ initialization
# 1. Choose first centroid randomly
# 2. For each subsequent centroid:
# - Calculate distance to nearest existing centroid for each point
# - Choose next centroid with probability proportional to squared distance
pass
# Test your implementation
def test_initialization_methods(X, k=3, n_runs=10):
"""Compare random vs K-Means++ initialization"""
# TODO: Compare performance of both methods
# Measure: final WCSS, number of iterations to converge
pass
# Example usage:
# X_test, _ = make_blobs(n_samples=300, centers=3, random_state=42)
# test_initialization_methods(X_test)Difficulty: Medium
Given a dataset, create dendrograms for different linkage methods and analyze which method works best.
# Exercise 6.2 Solution Template
def analyze_linkage_methods(X, methods=['single', 'complete', 'average', 'ward']):
"""
Analyze different hierarchical clustering linkage methods
TODO:
1. Create dendrograms for each method
2. Calculate silhouette scores for different numbers of clusters
3. Recommend best method and optimal number of clusters
"""
pass
# Test with different datasets:
# - Compact clusters (blobs)
# - Elongated clusters (moons)
# - Nested clusters (circles)Difficulty: Hard
Create an automated parameter tuning system for DBSCAN using multiple evaluation metrics.
# Exercise 6.3 Solution Template
def automated_dbscan_tuning(X, eps_range=None, min_samples_range=None):
"""
Automatically tune DBSCAN parameters
TODO:
1. Use k-distance plot to suggest eps range
2. Grid search over parameter combinations
3. Use multiple metrics: silhouette, noise ratio, cluster stability
4. Return best parameters and reasoning
"""
pass
def dbscan_stability_analysis(X, eps, min_samples, n_runs=10):
"""
Analyze DBSCAN stability across multiple runs with data subsampling
"""
passDifficulty: Hard
Complete end-to-end customer segmentation project with business recommendations.
Requirements: 1. Load and explore customer transaction data 2. Engineer meaningful features (RFM analysis, behavioral patterns) 3. Apply multiple clustering algorithms 4. Evaluate and select best approach 5. Create business insights and actionable recommendations 6. Build visualization dashboard
# Exercise 6.4 Project Template
class CustomerSegmentationProject:
def __init__(self):
self.data = None
self.processed_data = None
self.models = {}
self.results = {}
def load_data(self, file_path):
"""Load customer transaction data"""
pass
def feature_engineering(self):
"""Create RFM and behavioral features"""
pass
def apply_clustering_algorithms(self):
"""Apply multiple clustering methods"""
pass
def evaluate_models(self):
"""Compare clustering results"""
pass
def generate_business_insights(self):
"""Create actionable business recommendations"""
pass
def create_dashboard(self):
"""Build interactive visualization dashboard"""
pass
# Usage:
# project = CustomerSegmentationProject()
# project.load_data('customer_data.csv')
# project.feature_engineering()
# project.apply_clustering_algorithms()
# project.evaluate_models()
# project.generate_business_insights()
# project.create_dashboard()✅ Clustering Fundamentals - Understanding unsupervised learning principles - Types of clustering problems and applications - Distance metrics and similarity measures - Evaluation metrics for clustering
✅ K-Means Clustering - Algorithm implementation and optimization - Parameter selection (elbow method, silhouette analysis) - Variants: K-Means++, Mini-Batch K-Means - Advantages, limitations, and best practices
✅ Hierarchical Clustering - Agglomerative and divisive approaches - Linkage criteria and dendrogram interpretation - Connectivity-constrained clustering - When to choose hierarchical over partitional methods
✅ Advanced Clustering Techniques - DBSCAN for density-based clustering - Gaussian Mixture Models for soft clustering - Parameter tuning strategies - Algorithm selection guidelines
✅ Practical Applications - Customer segmentation analysis - Market research and positioning - Real-world case studies and business insights - Dashboard creation and presentation
🏢 Business Intelligence - Customer segmentation and lifetime value analysis - Market research and competitive positioning - Fraud detection and anomaly identification
🔬 Data Science - Exploratory data analysis and pattern discovery - Dimensionality reduction preprocessing - Feature engineering and selection
🎯 Marketing Analytics - Targeted campaign development - Product recommendation systems - Behavioral analysis and personalization
💻 Implementation Skills - From-scratch algorithm implementation - Scikit-learn library proficiency - Parameter tuning and optimization - Performance evaluation and comparison
📊 Visualization Skills - Cluster visualization techniques - Dendrogram interpretation - Business dashboard creation - Statistical plot generation
🧠 Analytical Skills - Algorithm selection criteria - Business insight generation - Statistical interpretation - Problem-solving methodology
The clustering techniques learned in this chapter provide the foundation for: - Chapter 7: Dimensionality Reduction (PCA, t-SNE) - Advanced ML: Ensemble methods and model combinations - Deep Learning: Unsupervised neural networks and autoencoders - Big Data: Distributed clustering algorithms
Algorithm Comparison: Compare K-Means, Hierarchical, and DBSCAN clustering algorithms in terms of computational complexity, scalability, and cluster shape assumptions.
Parameter Selection: Explain the trade-offs in
DBSCAN parameter selection and how the choice of eps and
min_samples affects clustering results.
Evaluation Metrics: Discuss the differences between internal and external clustering evaluation metrics. When would you use each type?
Custom Distance Metrics: Implement K-Means clustering with Manhattan distance instead of Euclidean distance.
Streaming Clustering: Design a system for clustering data streams where new data points arrive continuously.
Multi-Objective Clustering: Develop a clustering approach that optimizes for both cluster cohesion and business constraints.
Image Segmentation: Apply clustering techniques to segment images for computer vision applications.
Social Network Analysis: Use clustering to identify communities in social network data.
Gene Expression Analysis: Apply clustering to identify co-expressed genes in biological datasets.
End of Chapter 6: Clustering Algorithms
CO5 - Apply unsupervised learning models
By the end of this chapter, students will be able to: - Understand the curse of dimensionality and its impact on machine learning - Apply Principal Component Analysis (PCA) for dimensionality reduction - Implement t-SNE for high-dimensional data visualization - Use Linear Discriminant Analysis (LDA) for supervised dimensionality reduction - Select appropriate dimensionality reduction techniques for different scenarios - Integrate dimensionality reduction with clustering and classification pipelines
“The most beautiful thing we can experience is the mysterious. It is the source of all true art and science.” ✅ Albert Einstein
Imagine standing in a vast, invisible cathedral where each pillar represents a dimension of your data. In this sacred space of machine learning, we often find ourselves overwhelmed by thousands, sometimes millions of these pillars✅each feature a voice in a complex symphony of information. Yet, like a master conductor who can hear the essential melody beneath the orchestral complexity, dimensionality reduction allows us to distill this cacophony into pure, meaningful harmony.
This chapter is your journey into the profound art of seeing patterns in the unseen. We’ll explore how mathematical elegance meets computational necessity, where ancient geometric principles guide modern algorithms, and where the reduction of complexity reveals hidden beauty in data.
What awaits you in this chapter: - The
Philosophical Foundation: Understanding why “more” isn’t always
“better” in the mathematical universe - PCA as Mathematical
Archeology: Uncovering the principal stories hidden in your
data’s covariance structure
- t-SNE as Digital Artistry: Painting high-dimensional
landscapes on two-dimensional canvases - LDA as Supervised
Wisdom: Learning to see differences that matter most -
The Future Landscape: Emerging techniques that push the
boundaries of dimensional understanding
In this chapter, we don’t just learn algorithms✅we develop the artistic intuition of a data scientist who sees beyond dimensions.
“In higher dimensions, intuition goes to die, but mathematics comes alive.” ✅ Anonymous Data Scientist
Picture this: You’re an explorer in a mathematical universe where each step forward adds another dimension to your world. At first, moving from 1D to 2D to 3D feels natural✅we can visualize, touch, and understand these spaces. But as you venture into the 10th dimension, then the 100th, then the 1000th, something magical and terrifying happens: the very fabric of space begins to betray your intuition.
This is the curse of dimensionality✅not merely a technical challenge, but a profound philosophical statement about the nature of space, distance, and meaning in mathematics. It’s a phenomenon so counterintuitive that it forced mathematicians to rebuild their understanding of geometry itself.
In our three-dimensional world, if you double the radius of a sphere, its volume increases by a factor of 8 (2³). Intuitive, right? But in higher dimensions, something almost mystical occurs: most of a hypersphere’s volume concentrates in a thin shell near its surface. The interior becomes increasingly empty as dimensions grow.
This isn’t just mathematical curiosity✅it’s the reason why your machine learning algorithms sometimes seem to lose their way in high-dimensional space, why distances become meaningless, and why the very concept of “similarity” requires redefinition.
1. Exponential Growth of Space
import numpy as np
import matplotlib.pyplot as plt
def demonstrate_volume_growth():
"""Demonstrate how volume grows with dimensions"""
dimensions = range(1, 21)
volumes = []
# Calculate volume of unit hypersphere in d dimensions
for d in dimensions:
if d == 1:
volume = 2 # Line segment [-1, 1]
elif d == 2:
volume = np.pi # Circle with radius 1
else:
# Hypersphere volume formula
from math import gamma
volume = (np.pi**(d/2)) / gamma(d/2 + 1)
volumes.append(volume)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(dimensions, volumes, 'bo-')
plt.xlabel('Number of Dimensions')
plt.ylabel('Unit Hypersphere Volume')
plt.title('Volume Growth in High Dimensions')
plt.grid(True)
# Demonstrate distance distribution
np.random.seed(42)
dimensions_to_test = [2, 10, 50, 100]
plt.subplot(1, 2, 2)
for d in dimensions_to_test:
# Generate random points and calculate pairwise distances
points = np.random.normal(0, 1, (1000, d))
distances = []
for i in range(100): # Sample pairs
dist = np.linalg.norm(points[i] - points[i+1])
distances.append(dist)
plt.hist(distances, bins=20, alpha=0.6, label=f'D={d}', density=True)
plt.xlabel('Distance')
plt.ylabel('Density')
plt.title('Distance Distribution in Different Dimensions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
demonstrate_volume_growth()2. Distance Concentration In high dimensions, all points become approximately equidistant from each other.
def analyze_distance_concentration():
"""Analyze how distances concentrate in high dimensions"""
np.random.seed(42)
dimensions = [2, 5, 10, 20, 50, 100]
results = []
for d in dimensions:
# Generate random points
n_points = 1000
points = np.random.normal(0, 1, (n_points, d))
# Calculate all pairwise distances
distances = []
for i in range(min(100, n_points-1)): # Sample for efficiency
for j in range(i+1, min(i+11, n_points)):
dist = np.linalg.norm(points[i] - points[j])
distances.append(dist)
distances = np.array(distances)
# Calculate concentration metrics
mean_dist = np.mean(distances)
std_dist = np.std(distances)
coefficient_variation = std_dist / mean_dist
results.append({
'dimension': d,
'mean_distance': mean_dist,
'std_distance': std_dist,
'coefficient_variation': coefficient_variation
})
# Plot results
import pandas as pd
df = pd.DataFrame(results)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].plot(df['dimension'], df['mean_distance'], 'bo-')
axes[0].set_xlabel('Dimensions')
axes[0].set_ylabel('Mean Distance')
axes[0].set_title('Mean Distance vs Dimensions')
axes[0].grid(True)
axes[1].plot(df['dimension'], df['std_distance'], 'ro-')
axes[1].set_xlabel('Dimensions')
axes[1].set_ylabel('Standard Deviation')
axes[1].set_title('Distance Variation vs Dimensions')
axes[1].grid(True)
axes[2].plot(df['dimension'], df['coefficient_variation'], 'go-')
axes[2].set_xlabel('Dimensions')
axes[2].set_ylabel('Coefficient of Variation')
axes[2].set_title('Distance Concentration (Lower = More Concentrated)')
axes[2].grid(True)
plt.tight_layout()
plt.show()
print("Distance Concentration Analysis:")
print(df.round(4))
return df
concentration_results = analyze_distance_concentration()from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
def demonstrate_knn_degradation():
"""Show how KNN performance degrades with increasing dimensions"""
np.random.seed(42)
n_samples = 1000
dimensions_to_test = [2, 5, 10, 20, 50, 100, 200]
results = []
for n_features in dimensions_to_test:
print(f"Testing {n_features} dimensions...")
# Generate classification dataset
X, y = make_classification(n_samples=n_samples,
n_features=n_features,
n_informative=min(n_features, 10),
n_redundant=0,
n_clusters_per_class=1,
random_state=42)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Test KNN performance
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy')
results.append({
'dimensions': n_features,
'mean_accuracy': scores.mean(),
'std_accuracy': scores.std()
})
# Plot results
df_results = pd.DataFrame(results)
plt.figure(figsize=(10, 6))
plt.errorbar(df_results['dimensions'], df_results['mean_accuracy'],
yerr=df_results['std_accuracy'], marker='o', capsize=5)
plt.xlabel('Number of Dimensions')
plt.ylabel('KNN Accuracy')
plt.title('KNN Performance Degradation with Increasing Dimensions')
plt.grid(True)
plt.show()
print("\nKNN Performance vs Dimensions:")
print(df_results.round(4))
return df_results
knn_results = demonstrate_knn_degradation()import time
from sklearn.cluster import KMeans
def analyze_computational_complexity():
"""Analyze computational complexity with increasing dimensions"""
np.random.seed(42)
dimensions = [5, 10, 20, 50, 100, 200]
n_samples = 1000
computation_times = []
memory_usage = []
for d in dimensions:
print(f"Processing {d} dimensions...")
# Generate data
X = np.random.normal(0, 1, (n_samples, d))
# Measure KMeans computation time
start_time = time.time()
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X)
computation_time = time.time() - start_time
# Estimate memory usage (rough approximation)
memory_mb = X.nbytes / (1024 * 1024)
computation_times.append(computation_time)
memory_usage.append(memory_mb)
# Plot results
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(dimensions, computation_times, 'bo-')
axes[0].set_xlabel('Dimensions')
axes[0].set_ylabel('Computation Time (seconds)')
axes[0].set_title('KMeans Computation Time vs Dimensions')
axes[0].grid(True)
axes[1].plot(dimensions, memory_usage, 'ro-')
axes[1].set_xlabel('Dimensions')
axes[1].set_ylabel('Memory Usage (MB)')
axes[1].set_title('Memory Usage vs Dimensions')
axes[1].grid(True)
plt.tight_layout()
plt.show()
# Create summary
summary_df = pd.DataFrame({
'dimensions': dimensions,
'computation_time': computation_times,
'memory_mb': memory_usage
})
print("\nComputational Complexity Analysis:")
print(summary_df.round(4))
return summary_df
complexity_results = analyze_computational_complexity()1. High-Dimensional Data Symptoms
def diagnose_high_dimensional_data(X, feature_names=None):
"""Diagnose if dataset suffers from high-dimensional problems"""
n_samples, n_features = X.shape
print(f"=== High-Dimensional Data Diagnosis ===")
print(f"Dataset shape: {X.shape}")
print(f"Samples to features ratio: {n_samples/n_features:.2f}")
diagnoses = []
recommendations = []
# 1. Check samples-to-features ratio
if n_samples < n_features:
diagnoses.append(✅�️ More features than samples (n < p problem)")
recommendations.append("Apply dimensionality reduction or feature selection")
elif n_samples < 10 * n_features:
diagnoses.append(✅�️ Low samples-to-features ratio")
recommendations.append("Consider dimensionality reduction for better generalization")
# 2. Check for high sparsity
sparsity = np.mean(X == 0)
if sparsity > 0.8:
diagnoses.append(f✅�️ High sparsity ({sparsity:.1%} zeros)")
recommendations.append("Apply sparse-aware dimensionality reduction")
# 3. Check correlation structure
correlation_matrix = np.corrcoef(X.T)
high_correlations = np.sum(np.abs(correlation_matrix) > 0.8) - n_features # Exclude diagonal
if high_correlations > n_features:
diagnoses.append(f✅�️ Many highly correlated features ({high_correlations} pairs)")
recommendations.append("PCA can remove redundant information")
# 4. Check memory usage
memory_mb = X.nbytes / (1024 * 1024)
if memory_mb > 1000: # > 1GB
diagnoses.append(f✅�️ Large memory footprint ({memory_mb:.1f} MB)")
recommendations.append("Dimensionality reduction can reduce memory usage")
# 5. Estimate computation time for common algorithms
if n_features > 100:
diagnoses.append(✅�️ High computational complexity expected")
recommendations.append("Reduce dimensions before applying ML algorithms")
print(f"\nDiagnoses:")
for diagnosis in diagnoses:
print(f" {diagnosis}")
print(f"\nRecommendations:")
for recommendation in recommendations:
print(f" ✅ {recommendation}")
# Calculate some useful statistics
stats = {
'n_samples': n_samples,
'n_features': n_features,
'ratio': n_samples / n_features,
'sparsity': sparsity,
'memory_mb': memory_mb,
'high_correlations': high_correlations
}
return stats, diagnoses, recommendations
# Example usage with different datasets
datasets = [
('Low-dimensional', np.random.normal(0, 1, (1000, 10))),
('Balanced', np.random.normal(0, 1, (1000, 50))),
('High-dimensional', np.random.normal(0, 1, (100, 500))),
('Very high-dimensional', np.random.normal(0, 1, (50, 2000)))
]
for name, X in datasets:
print(f"\n{'='*50}")
print(f"Dataset: {name}")
stats, diagnoses, recommendations = diagnose_high_dimensional_data(X)✅ Computational Efficiency: Faster training and
prediction
✅ Memory Reduction: Lower storage requirements
✅ Visualization: Enable 2D/3D plotting of
high-dimensional data
✅ Noise Reduction: Remove irrelevant features and
noise
✅ Overfitting Prevention: Reduce model
complexity
✅ Feature Engineering: Create meaningful composite
features
✅ Information Loss: Some data variance is
discarded
✅ Interpretability: Transformed features may be harder
to interpret
✅ Additional Preprocessing: Extra computational step
required
✅ Parameter Tuning: Need to select number of
components/dimensions
✅ Algorithm Selection: Different methods suit
different data types
def analyze_dimensionality_tradeoffs(X, y=None, max_components=None):
"""Analyze trade-offs of different dimensionality reduction levels"""
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import time
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
n_samples, n_features = X_scaled.shape
if max_components is None:
max_components = min(n_features, n_samples) - 1
# Test different numbers of components
n_components_list = [2, 5, 10, 20, 50, min(100, max_components), max_components]
n_components_list = [n for n in n_components_list if n <= max_components]
results = []
for n_comp in n_components_list:
print(f"Testing {n_comp} components...")
# Apply PCA
pca = PCA(n_components=n_comp, random_state=42)
start_time = time.time()
X_reduced = pca.fit_transform(X_scaled)
transform_time = time.time() - start_time
# Calculate information retention
variance_explained = np.sum(pca.explained_variance_ratio_)
# Calculate compression ratio
original_size = X_scaled.nbytes
reduced_size = X_reduced.nbytes
compression_ratio = original_size / reduced_size
# If labels provided, test classification performance
classification_score = None
if y is not None:
try:
clf = LogisticRegression(random_state=42, max_iter=1000)
scores = cross_val_score(clf, X_reduced, y, cv=3)
classification_score = scores.mean()
except:
classification_score = None
results.append({
'n_components': n_comp,
'variance_explained': variance_explained,
'compression_ratio': compression_ratio,
'transform_time': transform_time,
'classification_score': classification_score,
'memory_reduction': 1 - (reduced_size / original_size)
})
# Create visualization
df_results = pd.DataFrame(results)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Variance explained
axes[0, 0].plot(df_results['n_components'], df_results['variance_explained'], 'bo-')
axes[0, 0].set_xlabel('Number of Components')
axes[0, 0].set_ylabel('Variance Explained')
axes[0, 0].set_title('Information Retention')
axes[0, 0].grid(True)
axes[0, 0].axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
axes[0, 0].legend()
# Compression ratio
axes[0, 1].plot(df_results['n_components'], df_results['compression_ratio'], 'ro-')
axes[0, 1].set_xlabel('Number of Components')
axes[0, 1].set_ylabel('Compression Ratio')
axes[0, 1].set_title('Memory Compression')
axes[0, 1].grid(True)
# Transform time
axes[1, 0].plot(df_results['n_components'], df_results['transform_time'], 'go-')
axes[1, 0].set_xlabel('Number of Components')
axes[1, 0].set_ylabel('Transform Time (seconds)')
axes[1, 0].set_title('Computational Efficiency')
axes[1, 0].grid(True)
# Classification performance (if available)
if any(df_results['classification_score'].notna()):
valid_results = df_results.dropna(subset=['classification_score'])
axes[1, 1].plot(valid_results['n_components'], valid_results['classification_score'], 'mo-')
axes[1, 1].set_xlabel('Number of Components')
axes[1, 1].set_ylabel('Classification Accuracy')
axes[1, 1].set_title('Predictive Performance')
axes[1, 1].grid(True)
else:
axes[1, 1].text(0.5, 0.5, 'No classification\ntarget provided',
ha='center', va='center', transform=axes[1, 1].transAxes)
axes[1, 1].set_title('Classification Performance')
plt.tight_layout()
plt.show()
print("\nDimensionality Reduction Trade-off Analysis:")
print(df_results.round(4))
# Find optimal number of components
if any(df_results['classification_score'].notna()):
# If classification available, optimize for 95% variance + good performance
candidates = df_results[df_results['variance_explained'] >= 0.95]
if not candidates.empty:
optimal = candidates.loc[candidates['classification_score'].idxmax()]
print(f"\nRecommended components: {optimal['n_components']}")
print(f" Variance explained: {optimal['variance_explained']:.1%}")
print(f" Classification score: {optimal['classification_score']:.3f}")
print(f" Compression ratio: {optimal['compression_ratio']:.1f}x")
else:
# Optimize for elbow in variance explained curve
# Find point where marginal gain drops significantly
variance_gains = np.diff(df_results['variance_explained'])
elbow_idx = np.argmax(variance_gains < 0.01) if any(variance_gains < 0.01) else len(variance_gains)
optimal_components = df_results.iloc[elbow_idx]['n_components']
print(f"\nRecommended components: {optimal_components}")
print(f" Variance explained: {df_results.iloc[elbow_idx]['variance_explained']:.1%}")
print(f" Compression ratio: {df_results.iloc[elbow_idx]['compression_ratio']:.1f}x")
return df_results
# Example usage
X_example, y_example = make_classification(n_samples=1000, n_features=100,
n_informative=20, random_state=42)
tradeoff_analysis = analyze_dimensionality_tradeoffs(X_example, y_example)“In the theater of high-dimensional space, PCA is both the choreographer and the audience✅it knows exactly where to look to see the most beautiful movements.”
Imagine you’re a photographer trying to capture the essence of a complex, swirling dance performance. From your position, you see bodies moving in seemingly chaotic patterns, but you know that somewhere in this three-dimensional choreography lies a simpler, more beautiful story. PCA is your magical lens✅it reveals the fundamental movements, the core rhythms that define the dance.
Principal Component Analysis isn’t just a dimensionality reduction technique✅it’s mathematical poetry in motion. It whispers to us the deepest secret of high-dimensional data: that beneath apparent complexity often lies elegant simplicity, waiting to be discovered by those who know how to look.
When PCA seeks directions of maximum variance, it’s not just performing a mathematical optimization✅it’s asking the data to reveal its most important stories. Variance is the language of difference, the vocabulary of variation. Where there is high variance, there are patterns, relationships, and insights waiting to be unlocked.
Think of it this way: If all your data points were identical, they would tell you nothing. It’s precisely in their differences✅their variance✅that information lives. PCA is the master detective who can spot these differences and organize them in order of importance.
1. Variance Maximization PCA seeks directions in which data varies the most. The first principal component captures maximum variance, the second captures maximum remaining variance (orthogonal to the first), and so on.
2. Covariance Matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
def understand_covariance_matrix():
"""Understand covariance matrix and its role in PCA"""
# Generate 2D correlated data
np.random.seed(42)
mean = [2, 3]
cov = [[2, 1.5], [1.5, 1]] # Covariance matrix
data = np.random.multivariate_normal(mean, cov, 300)
# Center the data
data_centered = data - np.mean(data, axis=0)
# Calculate covariance matrix
cov_matrix = np.cov(data_centered.T)
print("Original Covariance Matrix:")
print(cov_matrix)
# Calculate eigenvalues and eigenvectors
eigenvals, eigenvecs = np.linalg.eig(cov_matrix)
# Sort by eigenvalues (descending)
idx = eigenvals.argsort()[::-1]
eigenvals = eigenvals[idx]
eigenvecs = eigenvecs[:, idx]
print(f"\nEigenvalues: {eigenvals}")
print(f"Eigenvectors:\n{eigenvecs}")
# Visualization
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(data[:, 0], data[:, 1], alpha=0.6, label='Data')
plt.scatter(mean[0], mean[1], c='red', s=100, marker='x', label='Mean')
# Draw eigenvectors from mean
for i, (val, vec) in enumerate(zip(eigenvals, eigenvecs.T)):
plt.arrow(mean[0], mean[1], vec[0]*np.sqrt(val)*2, vec[1]*np.sqrt(val)*2,
head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
label=f'PC{i+1} (λ={val:.2f})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Data with Principal Components')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
# Show centered data
plt.subplot(1, 2, 2)
plt.scatter(data_centered[:, 0], data_centered[:, 1], alpha=0.6, label='Centered Data')
# Draw eigenvectors from origin
for i, (val, vec) in enumerate(zip(eigenvals, eigenvecs.T)):
plt.arrow(0, 0, vec[0]*np.sqrt(val)*2, vec[1]*np.sqrt(val)*2,
head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
label=f'PC{i+1} (λ={val:.2f})')
plt.xlabel('Feature 1 (centered)')
plt.ylabel('Feature 2 (centered)')
plt.title('Centered Data with Principal Components')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()
return data, cov_matrix, eigenvals, eigenvecs
data, cov_matrix, eigenvals, eigenvecs = understand_covariance_matrix()The principal components are the eigenvectors of the covariance matrix, and the eigenvalues represent the variance along each component.
def step_by_step_pca_math():
"""Step-by-step mathematical derivation of PCA"""
# Generate sample data
np.random.seed(42)
X = np.random.multivariate_normal([0, 0], [[3, 2], [2, 2]], 100)
print("Step-by-Step PCA Mathematical Process:")
print("="*50)
# Step 1: Center the data
print("Step 1: Center the data")
X_mean = np.mean(X, axis=0)
X_centered = X - X_mean
print(f"Original mean: {X_mean}")
print(f"Centered mean: {np.mean(X_centered, axis=0)}")
# Step 2: Compute covariance matrix
print("\nStep 2: Compute covariance matrix")
n_samples = X_centered.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)
print(f"Covariance matrix:\n{cov_matrix}")
# Step 3: Eigendecomposition
print("\nStep 3: Eigendecomposition")
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalues (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Step 4: Calculate explained variance
print("\nStep 4: Calculate explained variance")
total_variance = np.sum(eigenvalues)
explained_variance_ratio = eigenvalues / total_variance
print(f"Explained variance ratio: {explained_variance_ratio}")
print(f"Cumulative explained variance: {np.cumsum(explained_variance_ratio)}")
# Step 5: Transform data
print("\nStep 5: Transform data to PC space")
X_pca = X_centered @ eigenvectors
print(f"Original data shape: {X.shape}")
print(f"Transformed data shape: {X_pca.shape}")
print(f"Variance in PC space: {np.var(X_pca, axis=0)}")
# Verify: variance in PC space should equal eigenvalues
print(f"Eigenvalues: {eigenvalues}")
print(f"Verification: variances match eigenvalues: {np.allclose(np.var(X_pca, axis=0, ddof=1), eigenvalues)}")
# Step 6: Reconstruction
print("\nStep 6: Data reconstruction")
X_reconstructed = X_pca @ eigenvectors.T + X_mean
reconstruction_error = np.mean((X - X_reconstructed)**2)
print(f"Reconstruction error (full components): {reconstruction_error:.10f}")
# Partial reconstruction (using only first component)
X_pca_1d = X_pca[:, :1] # Only first component
X_reconstructed_1d = X_pca_1d @ eigenvectors[:1, :].T + X_mean
reconstruction_error_1d = np.mean((X - X_reconstructed_1d)**2)
print(f"Reconstruction error (1 component): {reconstruction_error_1d:.6f}")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Original data
axes[0, 0].scatter(X[:, 0], X[:, 1], alpha=0.6)
axes[0, 0].set_title('Original Data')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].axis('equal')
# Centered data with principal components
axes[0, 1].scatter(X_centered[:, 0], X_centered[:, 1], alpha=0.6)
for i, (val, vec) in enumerate(zip(eigenvalues, eigenvectors.T)):
axes[0, 1].arrow(0, 0, vec[0]*np.sqrt(val)*2, vec[1]*np.sqrt(val)*2,
head_width=0.1, head_length=0.1, fc=f'C{i}', ec=f'C{i}',
label=f'PC{i+1}')
axes[0, 1].set_title('Centered Data with PCs')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].axis('equal')
# Data in PC space
axes[1, 0].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.6)
axes[1, 0].set_xlabel('First Principal Component')
axes[1, 0].set_ylabel('Second Principal Component')
axes[1, 0].set_title('Data in Principal Component Space')
axes[1, 0].grid(True, alpha=0.3)
# Reconstruction comparison
axes[1, 1].scatter(X[:, 0], X[:, 1], alpha=0.6, label='Original')
axes[1, 1].scatter(X_reconstructed_1d[:, 0], X_reconstructed_1d[:, 1],
alpha=0.6, label='Reconstructed (1 PC)')
axes[1, 1].set_title('Original vs Reconstructed (1 PC)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axis('equal')
plt.tight_layout()
plt.show()
return X, X_pca, eigenvectors, eigenvalues
X, X_pca, eigenvectors, eigenvalues = step_by_step_pca_math()class PCAFromScratch:
"""Principal Component Analysis implementation from scratch"""
def __init__(self, n_components=None):
self.n_components = n_components
self.components_ = None
self.explained_variance_ = None
self.explained_variance_ratio_ = None
self.mean_ = None
def fit(self, X):
"""Fit PCA to data"""
# Center the data
self.mean_ = np.mean(X, axis=0)
X_centered = X - self.mean_
# Compute covariance matrix
n_samples = X.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalues (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Store results
if self.n_components is None:
self.n_components = len(eigenvalues)
self.components_ = eigenvectors[:, :self.n_components].T
self.explained_variance_ = eigenvalues[:self.n_components]
# Calculate explained variance ratio
total_variance = np.sum(eigenvalues)
self.explained_variance_ratio_ = self.explained_variance_ / total_variance
return self
def transform(self, X):
"""Transform data to principal component space"""
X_centered = X - self.mean_
return X_centered @ self.components_.T
def fit_transform(self, X):
"""Fit and transform in one step"""
return self.fit(X).transform(X)
def inverse_transform(self, X_transformed):
"""Reconstruct original data from transformed data"""
return X_transformed @ self.components_ + self.mean_
def get_covariance(self):
"""Get the covariance matrix of the original data"""
return (self.components_.T * self.explained_variance_) @ self.components_
# Test custom PCA implementation
def test_custom_pca():
"""Test our custom PCA implementation"""
# Generate test data
np.random.seed(42)
X_test = np.random.multivariate_normal([1, 2], [[2, 1.5], [1.5, 1]], 200)
# Apply custom PCA
pca_custom = PCAFromScratch(n_components=2)
X_transformed_custom = pca_custom.fit_transform(X_test)
# Apply scikit-learn PCA for comparison
from sklearn.decomposition import PCA
pca_sklearn = PCA(n_components=2)
X_transformed_sklearn = pca_sklearn.fit_transform(X_test)
print("Custom PCA vs Scikit-learn PCA Comparison:")
print("="*50)
print(f"Explained variance ratio (Custom): {pca_custom.explained_variance_ratio_}")
print(f"Explained variance ratio (Sklearn): {pca_sklearn.explained_variance_ratio_}")
print(f"Components shape (Custom): {pca_custom.components_.shape}")
print(f"Components shape (Sklearn): {pca_sklearn.components_.shape}")
# Check if components are the same (allowing for sign flip)
components_match = np.allclose(np.abs(pca_custom.components_),
np.abs(pca_sklearn.components_), atol=1e-10)
print(f"Components match: {components_match}")
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].scatter(X_test[:, 0], X_test[:, 1], alpha=0.6)
axes[0].set_title('Original Data')
axes[0].grid(True, alpha=0.3)
axes[1].scatter(X_transformed_custom[:, 0], X_transformed_custom[:, 1], alpha=0.6)
axes[1].set_title('Custom PCA')
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].grid(True, alpha=0.3)
axes[2].scatter(X_transformed_sklearn[:, 0], X_transformed_sklearn[:, 1], alpha=0.6)
axes[2].set_title('Scikit-learn PCA')
axes[2].set_xlabel('PC1')
axes[2].set_ylabel('PC2')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return pca_custom, pca_sklearn
pca_custom, pca_sklearn = test_custom_pca()def efficient_pca_methods():
"""Compare different PCA computation methods for efficiency"""
from sklearn.decomposition import PCA, IncrementalPCA, TruncatedSVD
import time
# Generate large dataset
np.random.seed(42)
n_samples, n_features = 5000, 1000
X_large = np.random.normal(0, 1, (n_samples, n_features))
methods = {
'Standard PCA': PCA(n_components=50, random_state=42),
'Incremental PCA': IncrementalPCA(n_components=50, batch_size=500),
'Truncated SVD': TruncatedSVD(n_components=50, random_state=42)
}
results = {}
print("Efficiency Comparison for Large Dataset:")
print(f"Dataset shape: {X_large.shape}")
print("="*50)
for method_name, method in methods.items():
print(f"\nTesting {method_name}...")
# Measure fitting time
start_time = time.time()
X_transformed = method.fit_transform(X_large)
fit_time = time.time() - start_time
# Measure memory usage (approximate)
memory_usage = X_transformed.nbytes / (1024**2) # MB
results[method_name] = {
'fit_time': fit_time,
'memory_usage': memory_usage,
'explained_variance': getattr(method, 'explained_variance_ratio_', None)
}
print(f" Fit time: {fit_time:.3f} seconds")
print(f" Memory usage: {memory_usage:.2f} MB")
if hasattr(method, 'explained_variance_ratio_'):
print(f" Total variance explained: {np.sum(method.explained_variance_ratio_):.3f}")
# Visualize comparison
methods_list = list(results.keys())
fit_times = [results[m]['fit_time'] for m in methods_list]
memory_usage = [results[m]['memory_usage'] for m in methods_list]
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].bar(methods_list, fit_times, color=['skyblue', 'lightgreen', 'lightcoral'])
axes[0].set_ylabel('Fit Time (seconds)')
axes[0].set_title('Computation Time Comparison')
axes[0].tick_params(axis='x', rotation=45)
axes[1].bar(methods_list, memory_usage, color=['skyblue', 'lightgreen', 'lightcoral'])
axes[1].set_ylabel('Memory Usage (MB)')
axes[1].set_title('Memory Usage Comparison')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
return results
efficiency_results = efficient_pca_methods()def analyze_explained_variance(X, max_components=None):
"""Analyze explained variance to select optimal number of components"""
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine maximum components
n_samples, n_features = X_scaled.shape
if max_components is None:
max_components = min(n_samples, n_features)
# Fit PCA with all components
pca_full = PCA(n_components=max_components)
pca_full.fit(X_scaled)
# Calculate cumulative explained variance
explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
# Find components needed for different variance thresholds
thresholds = [0.80, 0.90, 0.95, 0.99]
threshold_components = []
for threshold in thresholds:
n_comp = np.argmax(cumulative_variance >= threshold) + 1
threshold_components.append(n_comp)
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Individual explained variance
axes[0, 0].bar(range(1, min(21, len(explained_variance_ratio)+1)),
explained_variance_ratio[:20], alpha=0.7)
axes[0, 0].set_xlabel('Principal Component')
axes[0, 0].set_ylabel('Explained Variance Ratio')
axes[0, 0].set_title('Individual Component Variance (First 20)')
axes[0, 0].grid(True, alpha=0.3)
# Cumulative explained variance
components_range = range(1, len(cumulative_variance) + 1)
axes[0, 1].plot(components_range, cumulative_variance, 'b-o', markersize=3)
# Add threshold lines
colors = ['red', 'orange', 'green', 'purple']
for threshold, n_comp, color in zip(thresholds, threshold_components, colors):
axes[0, 1].axhline(y=threshold, color=color, linestyle='--', alpha=0.7,
label=f'{threshold:.0%} ({n_comp} comp)')
axes[0, 1].axvline(x=n_comp, color=color, linestyle='--', alpha=0.7)
axes[0, 1].set_xlabel('Number of Components')
axes[0, 1].set_ylabel('Cumulative Explained Variance')
axes[0, 1].set_title('Cumulative Explained Variance')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Scree plot (eigenvalues)
eigenvalues = pca_full.explained_variance_
axes[1, 0].plot(range(1, min(21, len(eigenvalues)+1)), eigenvalues[:20], 'ro-')
axes[1, 0].set_xlabel('Principal Component')
axes[1, 0].set_ylabel('Eigenvalue')
axes[1, 0].set_title('Scree Plot (First 20 Components)')
axes[1, 0].grid(True, alpha=0.3)
# Elbow detection
if len(eigenvalues) > 3:
# Calculate second derivative to find elbow
second_derivative = np.gradient(np.gradient(eigenvalues[:20]))
elbow_idx = np.argmax(second_derivative) + 1 if any(second_derivative > 0) else len(second_derivative)
optimal_components = df_results.iloc[elbow_idx]['n_components']
axes[1, 0].axvline(x=elbow_idx, color='green', linestyle='--',
label=f'Elbow at {elbow_idx}')
axes[1, 0].legend()
# Component selection summary
axes[1, 1].axis('off')
summary_text = "Component Selection Summary:\n\n"
for threshold, n_comp in zip(thresholds, threshold_components):
summary_text += f"{threshold:.0%} variance: {n_comp} components\n"
if len(eigenvalues) > 3:
summary_text += f"\nElbow method suggests: {elbow_idx} components\n"
# Add practical recommendations
summary_text += "\nRecommendations:\n"
summary_text += f"✅ For visualization: 2-3 components\n"
summary_text += f"✅ For preprocessing: {threshold_components[1]} components (90%)\n"
summary_text += f"✅ For high accuracy: {threshold_components[2]} components (95%)\n"
axes[1, 1].text(0.1, 0.9, summary_text, transform=axes[1, 1].transAxes,
fontsize=11, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
plt.tight_layout()
plt.show()
# Return analysis results
analysis_results = {
'explained_variance_ratio': explained_variance_ratio,
'cumulative_variance': cumulative_variance,
'threshold_components': dict(zip(thresholds, threshold_components)),
'eigenvalues': eigenvalues
}
if len(eigenvalues) > 3:
analysis_results['elbow_point'] = elbow_idx
return analysis_results
# Example usage with different datasets
datasets = {
'Random Data': np.random.normal(0, 1, (500, 50)),
'Correlated Data': None # Will generate correlated data
}
# Generate correlated data
np.random.seed(42)
base_data = np.random.normal(0, 1, (500, 10))
noise = np.random.normal(0, 0.1, (500, 40))
correlated_data = np.column_stack([
base_data,
base_data[:, :5] + noise[:, :5], # Correlated features
base_data[:, :10] * 0.5 + noise[:, 5:15], # Partially correlated
noise[:, 15:] # Pure noise
])
datasets['Correlated Data'] = correlated_data
for name, X in datasets.items():
print(f"\n{'='*60}")
print(f"Analysis for {name}")
print(f"{'='*60}")
if X is not None:
variance_analysis = analyze_explained_variance(X, max_components=30)def pca_cross_validation_selection(X, y, max_components=20):
"""Select optimal number of PCA components using cross-validation"""
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Test different numbers of components
n_components_range = range(1, min(max_components + 1, X_scaled.shape[1]))
cv_scores = []
cv_stds = []
print("Cross-Validation Component Selection:")
print("="*40)
for n_comp in n_components_range:
# Create pipeline
pipeline = Pipeline([
('pca', PCA(n_components=n_comp, random_state=42)),
('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
# Cross-validation
scores = cross_val_score(pipeline, X_scaled, y, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
cv_stds.append(scores.std())
if n_comp <= 10 or n_comp % 5 == 0: # Print selected results
print(f" {n_comp:2d} components: {scores.mean():.4f} ± {scores.std():.4f}")
# Find optimal number of components
optimal_idx = np.argmax(cv_scores)
optimal_components = n_components_range[optimal_idx]
optimal_score = cv_scores[optimal_idx]
print(f"\nOptimal components: {optimal_components}")
print(f"Best CV score: {optimal_score:.4f} ± {cv_stds[optimal_idx]:.4f}")
# Visualization
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.errorbar(n_components_range, cv_scores, yerr=cv_stds,
marker='o', capsize=5, capthick=2)
plt.axvline(x=optimal_components, color='red', linestyle='--',
label=f'Optimal: {optimal_components}')
plt.xlabel('Number of Components')
plt.ylabel('Cross-Validation Accuracy')
plt.title('PCA Component Selection via Cross-Validation')
plt.legend()
plt.grid(True, alpha=0.3)
# Compare with baseline (no PCA)
baseline_pipeline = Pipeline([
('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
baseline_scores = cross_val_score(baseline_pipeline, X_scaled, y, cv=5)
baseline_mean = baseline_scores.mean()
plt.subplot(1, 2, 2)
performance_comparison = [baseline_mean, optimal_score]
labels = ['No PCA\n(All Features)', f'PCA\n({optimal_components} Components)']
colors = ['lightcoral', 'lightgreen']
bars = plt.bar(labels, performance_comparison, color=colors)
plt.ylabel('Cross-Validation Accuracy')
plt.title('Performance Comparison')
# Add value labels on bars
for bar, value in zip(bars, performance_comparison):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{value:.4f}', ha='center', va='bottom')
plt.ylim(min(performance_comparison) - 0.05, max(performance_comparison) + 0.05)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return {
'n_components_range': list(n_components_range),
'cv_scores': cv_scores,
'cv_stds': cv_stds,
'optimal_components': optimal_components,
'optimal_score': optimal_score,
'baseline_score': baseline_mean
}
# Example usage
X_example, y_example = make_classification(n_samples=1000, n_features=50,
n_informative=15, n_redundant=10,
random_state=42)
cv_results = pca_cross_validation_selection(X_example, y_example, max_components=30)def pca_visualization_techniques():
"""Demonstrate PCA for data visualization"""
from sklearn.datasets import load_digits, load_wine, load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load different datasets
datasets = {
'Digits': load_digits(),
'Wine': load_wine(),
'Breast Cancer': load_breast_cancer()
}
fig, axes = plt.subplots(len(datasets), 3, figsize=(15, 5 * len(datasets)))
for i, (name, dataset) in enumerate(datasets.items()):
X, y = dataset.data, dataset.target
print(f"\n{name} Dataset:")
print(f" Original shape: {X.shape}")
print(f" Number of classes: {len(np.unique(y))}")
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca_2d = PCA(n_components=2, random_state=42)
X_pca_2d = pca_2d.fit_transform(X_scaled)
pca_3d = PCA(n_components=3, random_state=42)
X_pca_3d = pca_3d.fit_transform(X_scaled)
# 2D visualization
scatter = axes[i, 0].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[i, 0].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})')
axes[i, 0].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})')
axes[i, 0].set_title(f'{name} - 2D PCA')
axes[i, 0].grid(True, alpha=0.3)
# Explained variance plot
pca_full = PCA().fit(X_scaled)
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
axes[i, 1].plot(range(1, len(cumvar[:20]) + 1), cumvar[:20], 'bo-')
axes[i, 1].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[i, 1].set_xlabel('Number of Components')
axes[i, 1].set_ylabel('Cumulative Explained Variance')
axes[i, 1].set_title(f'{name} - Explained Variance')
axes[i, 1].legend()
axes[i, 1].grid(True, alpha=0.3)
# Component interpretation (feature importance)
n_features_show = min(10, X.shape[1])
component_importance = np.abs(pca_2d.components_[0, :n_features_show])
feature_names = getattr(dataset, 'feature_names',
[f'Feature_{j}' for j in range(X.shape[1])])
axes[i, 2].barh(range(10), component_importance[top_features_pca])
axes[i, 2].set_yticks(range(10))
axes[i, 2].set_yticklabels([feature_names[j][:15] for j in range(n_features_show)])
axes[i, 2].set_xlabel('Absolute Component Weight')
axes[i, 2].set_title(f'{name} - PC1 Feature Importance')
axes[i, 2].grid(True, alpha=0.3)
print(f" 2D PCA variance explained: {np.sum(pca_2d.explained_variance_ratio_):.1%}")
print(f" 3D PCA variance explained: {np.sum(pca_3d.explained_variance_ratio_):.1%}")
plt.tight_layout()
plt.show()
pca_visualization_techniques()def pca_noise_reduction_demo():
"""Demonstrate PCA for noise reduction and data compression"""
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load digit data
digits = load_digits()
X_digits = digits.data # 400 faces, each 64×64 pixels (4096 features)
print("PCA for Noise Reduction and Compression Demo:")
print("="*50)
# Add noise to simulate real-world conditions
np.random.seed(42)
noise = np.random.normal(0, 2, X_digits.shape)
X_noisy = X_digits + noise
# Standardize data
scaler = StandardScaler()
X_noisy_scaled = scaler.fit_transform(X_noisy)
# Test different numbers of components
n_components_list = [10, 20, 50, 100, 200, 500]
# Apply PCA to entire dataset
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_digits)
# Analyze compression results
compression_results = []
fig, axes = plt.subplots(2, len(n_components_list), figsize=(20, 8))
for i, n_comp in enumerate(n_components_list):
# Reconstruct using n components
pca = PCA(n_components=n_comp, random_state=42)
X_pca = pca.fit_transform(X_noisy_scaled)
X_reconstructed = pca.inverse_transform(X_pca)
# Calculate metrics
mse = np.mean((X_digits - X_reconstructed) ** 2)
variance_explained = np.sum(pca.explained_variance_ratio_)
compression_ratio = 4096 / (n_comp + n_comp * 4096 / 400) # Approximate
compression_results.append({
'n_components': n_comp,
'mse': mse,
'variance_explained': variance_explained,
'compression_ratio': compression_ratio
})
# Display original and reconstructed
axes[0, i].imshow(original_face, cmap='gray')
axes[0, i].set_title(f'Original')
axes[0, i].axis('off')
axes[1, i].imshow(reconstructed_face, cmap='gray')
axes[1, i].set_title(f'{n_comp} comp\\nMSE: {mse:.4f}\\nVar: {variance_explained:.1%}')
axes[1, i].axis('off')
print(f"{n_comp:3d} components: MSE={mse:.4f}, Variance={variance_explained:.1%}, Compression={compression_ratio:.1f}x")
plt.tight_layout()
plt.show()
# Analysis plots
df_compression = pd.DataFrame(compression_results)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# MSE vs Components
axes[0].plot(df_compression['n_components'], df_compression['mse'], 'ro-')
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('Mean Squared Error')
axes[0].set_title('Reconstruction Error vs Components')
axes[0].grid(True, alpha=0.3)
# Variance Explained
axes[1].plot(df_compression['n_components'], df_compression['variance_explained'], 'bo-')
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Variance Explained')
axes[1].set_title('Information Retention vs Components')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Compression Trade-off
axes[2].scatter(df_compression['compression_ratio'], df_compression['mse'],
c=df_compression['n_components'], cmap='viridis', s=100)
axes[2].set_xlabel('Compression Ratio')
axes[2].set_ylabel('Reconstruction Error (MSE)')
axes[2].set_title('Compression vs Quality Trade-off')
# Add component labels
for _, row in df_compression.iterrows():
axes[2].annotate(f"{row['n_components']}",
(row['compression_ratio'], row['mse']),
xytext=(5, 5), textcoords='offset points', fontsize=8)
plt.colorbar(axes[2].collections[0], ax=axes[2], label='Components')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return df_compression
compression_lab_results = pca_noise_reduction_demo()Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction technique that finds the directions that best separate different classes. Unlike PCA, which maximizes variance, LDA maximizes class separability.
LDA seeks to find a projection that maximizes the ratio of between-class variance to within-class variance:
J(w) = (w^T S_B w) / (w^T S_W w)
Where: - S_B = between-class scatter matrix - S_W = within-class scatter matrix - w = projection vector
def lda_vs_pca_comparison():
"""Compare LDA and PCA for dimensionality reduction"""
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
# Generate classification dataset with overlapping classes
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_clusters_per_class=2,
class_sep=0.8, random_state=42)
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("LDA vs PCA Comparison:")
print("="*30)
print(f"Original data shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
# Apply PCA and LDA
pca = PCA(n_components=2, random_state=42)
lda = LDA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_lda = lda.fit_transform(X_scaled, y)
# Evaluate classification performance
classifier = SVC(random_state=42)
# Original data performance
scores_original = cross_val_score(classifier, X_scaled, y, cv=5)
# PCA performance
scores_pca = cross_val_score(classifier, X_pca, y, cv=5)
# LDA performance
scores_lda = cross_val_score(classifier, X_lda, y, cv=5)
print(f"\nClassification Performance:")
print(f"Original (20D): {scores_original.mean():.4f} ± {scores_original.std():.4f}")
print(f"PCA (2D): {scores_pca.mean():.4f} ± {scores_pca.std():.4f}")
print(f"LDA (2D): {scores_lda.mean():.4f} ± {scores_lda.std():.4f}")
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# PCA visualization
scatter_pca = axes[0, 0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
axes[0, 0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[0, 0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
axes[0, 0].set_title('PCA Projection')
axes[0, 0].grid(True, alpha=0.3)
plt.colorbar(scatter_pca, ax=axes[0, 0])
# LDA visualization
scatter_lda = axes[0, 1].scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='viridis', alpha=0.7)
axes[0, 1].set_xlabel('LD1')
axes[0, 1].set_ylabel('LD2')
axes[0, 1].set_title('LDA Projection')
axes[0, 1].grid(True, alpha=0.3)
plt.colorbar(scatter_lda, ax=axes[0, 1])
# Performance comparison
methods = ['Original\n(20D)', 'PCA\n(2D)', 'LDA\n(2D)']
performances = [scores_original.mean(), scores_pca.mean(), scores_lda.mean()]
errors = [scores_original.std(), scores_pca.std(), scores_lda.std()]
bars = axes[0, 2].bar(methods, performances, yerr=errors, capsize=5,
color=['lightblue', 'lightgreen', 'lightcoral'])
axes[0, 2].set_ylabel('Cross-Validation Accuracy')
axes[0, 2].set_title('Performance Comparison')
axes[0, 2].grid(True, alpha=0.3)
# Add value labels on bars
for bar, value in zip(bars, performances):
axes[0, 2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{value:.3f}', ha='center', va='bottom')
# Component analysis
# PCA components (feature importance)
pca_components = pca.components_
rf_importance = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_pca, y).feature_importances_
# Calculate original feature importance through PCA
original_importance = np.abs(pca_components.T @ rf_importance)
# Show top 15 features
top_features = np.argsort(original_importance)[-15:]
axes[1, 0].barh(range(15), original_importance[top_features])
axes[1, 0].set_yticks(range(15))
axes[1, 0].set_yticklabels([f'Feature {i}' for i in top_features])
axes[1, 0].set_xlabel('Importance Score')
axes[1, 0].set_title('PCA Feature Importance')
axes[1, 0].grid(True, alpha=0.3)
# LDA components (discriminant weights)
lda_importance = np.abs(lda.scalings_[:, 0]) # First discriminant
top_features_lda = np.argsort(lda_importance)[-10:]
axes[1, 1].barh(range(10), lda_importance[top_features_lda])
axes[1, 1].set_yticks(range(10))
axes[1, 1].set_yticklabels([f'Feature {i}' for i in top_features_lda])
axes[1, 1].set_xlabel('Absolute Discriminant Weight')
axes[1, 1].set_title('LDA Feature Importance')
axes[1, 1].grid(True, alpha=0.3)
# Class separability analysis
# Calculate Fisher's ratio for both methods
def fishers_ratio(X_proj, y):
"""Calculate Fisher's ratio (between-class / within-class variance)"""
classes = np.unique(y)
class_means = [X_proj[y == c].mean(axis=0) for c in classes]
overall_mean = X_proj.mean(axis=0)
# Between-class variance
between_var = sum(len(X_proj[y == c]) * np.sum((class_means[i] - overall_mean)**2)
for i, c in enumerate(classes))
# Within-class variance
within_var = sum(np.sum((X_proj[y == c] - class_means[i])**2)
for i, c in enumerate(classes))
return between_var / within_var if within_var > 0 else 0
fisher_pca = fishers_ratio(X_pca, y)
fisher_lda = fishers_ratio(X_lda, y)
fisher_ratios = [fisher_pca, fisher_lda]
method_names = ['PCA', 'LDA']
bars = axes[1, 2].bar(method_names, fisher_ratios, color=['lightgreen', 'lightcoral'])
axes[1, 2].set_ylabel("Fisher's Ratio")
axes[1, 2].set_title('Class Separability Comparison')
axes[1, 2].grid(True, alpha=0.3)
# Add value labels
for bar, ratio in zip(bars, fisher_ratios):
axes[1, 2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
f'{ratio:.2f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
return {
'pca_performance': scores_pca.mean(),
'lda_performance': scores_lda.mean(),
'fisher_pca': fisher_pca,
'fisher_lda': fisher_lda
}
lda_pca_results = lda_vs_pca_comparison()class LDAFromScratch:
"""Linear Discriminant Analysis implementation from scratch"""
def __init__(self, n_components=None):
self.n_components = n_components
self.scalings_ = None
self.means_ = None
self.classes_ = None
def fit(self, X, y):
"""Fit LDA to training data"""
self.classes_ = np.unique(y)
n_classes = len(self.classes_)
n_features = X.shape[1]
# If n_components not specified, use maximum possible
if self.n_components is None:
self.n_components = min(n_classes - 1, n_features)
# Calculate class means
class_means = []
for c in self.classes_:
class_means.append(X[y == c].mean(axis=0))
class_means = np.array(class_means)
# Overall mean
overall_mean = X.mean(axis=0)
# Between-class scatter matrix (S_B)
S_B = np.zeros((n_features, n_features))
for i, c in enumerate(self.classes_):
n_c = np.sum(y == c)
mean_diff = (class_means[i] - overall_mean).reshape(-1, 1)
S_B += n_c * (mean_diff @ mean_diff.T)
# Within-class scatter matrix (S_W)
S_W = np.zeros((n_features, n_features))
for c in self.classes_:
class_data = X[y == c]
class_mean = class_data.mean(axis=0)
for sample in class_data:
diff = (sample - class_mean).reshape(-1, 1)
S_W += diff @ diff.T
# Solve generalized eigenvalue problem: S_B * v = λ * S_W * v
# This is equivalent to: S_W^(-1) * S_B * v = λ * v
try:
# Add small regularization to avoid singular matrix
S_W_reg = S_W + np.eye(n_features) * 1e-6
eigenvals, eigenvecs = np.linalg.eig(np.linalg.inv(S_W_reg) @ S_B)
# Sort by eigenvalues (descending)
idx = eigenvals.argsort()[::-1]
eigenvals = eigenvals[idx]
eigenvecs = eigenvecs[:, idx]
# Select top components
self.scalings_ = eigenvecs[:, :self.n_components]
self.eigenvalues_ = eigenvals[:self.n_components]
except np.linalg.LinAlgError:
# Fallback to SVD if matrix is singular
U, s, Vt = np.linalg.svd(S_B)
self.scalings_ = U[:, :self.n_components]
self.eigenvalues_ = s[:self.n_components]
self.means_ = class_means
return self
def transform(self, X):
"""Transform data to discriminant space"""
return X @ self.scalings_
def fit_transform(self, X, y):
"""Fit and transform in one step"""
return self.fit(X, y).transform(X)
# Test custom LDA implementation
def test_custom_lda():
"""Test our custom LDA implementation"""
# Generate test data
np.random.seed(42)
X_test, y_test = make_classification(n_samples=300, n_features=10,
n_classes=3, n_informative=8,
random_state=42)
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_test)
# Apply custom LDA
lda_custom = LDAFromScratch(n_components=2)
X_custom = lda_custom.fit_transform(X_scaled, y_test)
# Apply sklearn LDA
lda_sklearn = LDA(n_components=2)
X_sklearn = lda_sklearn.fit_transform(X_scaled, y_test)
print("Custom LDA vs Scikit-learn LDA:")
print("="*35)
print(f"Custom LDA shape: {X_custom.shape}")
print(f"Sklearn LDA shape: {X_sklearn.shape}")
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Original data (first 2 features)
axes[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_test, cmap='viridis', alpha=0.7)
axes[0].set_title('Original Data (First 2 Features)')
axes[0].grid(True, alpha=0.3)
axes[1].scatter(X_custom[:, 0], X_custom[:, 1], c=y_test, cmap='viridis', alpha=0.7)
axes[1].set_title('Custom LDA')
axes[1].set_xlabel('LD1')
axes[1].set_ylabel('LD2')
axes[1].grid(True, alpha=0.3)
axes[2].scatter(X_sklearn[:, 0], X_sklearn[:, 1], c=y_test, cmap='viridis', alpha=0.7)
axes[2].set_title('Scikit-learn LDA')
axes[2].set_xlabel('LD1')
axes[2].set_ylabel('LD2')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return lda_custom, lda_sklearn
custom_lda, sklearn_lda = test_custom_lda()t-SNE is a non-linear dimensionality reduction technique primarily used for data visualization. It preserves local neighborhood structure and can reveal clusters that linear methods might miss.
1. Probability Distributions: t-SNE converts distances to probabilities 2. Kullback-Leibler Divergence: Minimizes difference between high-D and low-D distributions 3. Perplexity: Controls the effective number of neighbors
def tsne_comprehensive_analysis():
"""Comprehensive t-SNE analysis with parameter exploration"""
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits, make_swiss_roll
from sklearn.preprocessing import StandardScaler
# Load datasets
datasets = {
'Digits': load_digits(),
'Swiss Roll': make_swiss_roll(n_samples=1000, random_state=42)
}
# t-SNE parameters to test
perplexity_values = [5, 15, 30, 50]
learning_rates = [10, 200, 1000]
for dataset_name, dataset in datasets.items():
if dataset_name == 'Swiss Roll':
X, color = dataset
y = color # Use color for visualization
else:
X, y = dataset.data, dataset.target
print(f"\n{'='*50}")
print(f"t-SNE Analysis: {dataset_name}")
print(f"{'='*50}")
print(f"Data shape: {X.shape}")
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Test different perplexity values
fig, axes = plt.subplots(2, len(perplexity_values), figsize=(20, 10))
for i, perplexity in enumerate(perplexity_values):
print(f"Processing perplexity = {perplexity}...")
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42,
learning_rate=200, n_iter=1000)
X_tsne = tsne.fit_transform(X_scaled)
# Plot result
scatter = axes[0, i].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y,
cmap='tab10' if dataset_name == 'Digits' else 'viridis',
alpha=0.7, s=20)
axes[0, i].set_title(f'Perplexity = {perplexity}')
axes[0, i].grid(True, alpha=0.3)
# Calculate and plot KL divergence over iterations (if available)
# Note: sklearn doesn't expose KL divergence history, so we'll show final result
axes[1, i].text(0.5, 0.5, f'Perplexity: {perplexity}\\nFinal KL Divergence: {tsne.kl_divergence_:.2f}',
ha='center', va='center', transform=axes[1, i].transAxes,
bbox=dict(boxstyle='round', facecolor='lightgray'))
axes[1, i].set_title('Optimization Info')
plt.suptitle(f't-SNE Perplexity Comparison - {dataset_name}', fontsize=16)
plt.tight_layout()
plt.show()
# Compare with PCA
print("\nComparing t-SNE with PCA...")
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
# Best t-SNE (perplexity=30 is usually good default)
tsne_best = TSNE(n_components=2, perplexity=30, random_state=42, learning_rate=200)
X_tsne_best = tsne_best.fit_transform(X_scaled)
# Visualization comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# Original data (first 2 features)
if X.shape[1] >= 2:
axes[0].scatter(X[:, 0], X[:, 1], c=y,
cmap='tab10' if dataset_name == 'Digits' else 'viridis',
alpha=0.7)
axes[0].set_title('Original Data (First 2 Features)')
feature_names = getattr(dataset, 'feature_names', ['Feature 0', 'Feature 1'])
axes[0].set_xlabel(feature_names[0][:20])
axes[0].set_ylabel(feature_names[1][:20])
else:
axes[0].text(0.5, 0.5, 'Not applicable', ha='center', va='center',
transform=axes[0].transAxes)
axes[0].set_title('Original Data')
# PCA
scatter = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[1].set_title(f'PCA ({pca.explained_variance_ratio_.sum():.1%} var)')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
# t-SNE
scatter = axes[2].scatter(X_tsne_best[:, 0], X_tsne_best[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[2].set_title('t-SNE (Perplexity=30)')
axes[2].set_xlabel('t-SNE 1')
axes[2].set_ylabel('t-SNE 2')
for ax in axes:
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return results
tsne_comprehensive_analysis()def tsne_parameter_optimization():
"""Optimize t-SNE parameters for best visualization"""
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
# Load data
digits = load_digits()
X, y = digits.data, digits.target
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Parameter grid
perplexity_range = [5, 10, 15, 20, 30, 40, 50]
learning_rate_range = [10, 50, 100, 200, 500, 1000]
print("t-SNE Parameter Optimization:")
print("="*30)
# Store results
results = []
# Test subset of combinations for efficiency
param_combinations = [
(5, 200), (15, 200), (30, 200), (50, 200), # Different perplexities
(30, 10), (30, 100), (30, 500), (30, 1000) # Different learning rates
]
for perplexity, learning_rate in param_combinations:
print(f"Testing perplexity={perplexity}, learning_rate={learning_rate}")
try:
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=perplexity,
learning_rate=learning_rate, random_state=42,
n_iter=1000, verbose=0)
X_tsne = tsne.fit_transform(X_scaled)
# Evaluate clustering quality using silhouette score
# Apply KMeans to t-SNE result
kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_tsne)
# Calculate silhouette score
sil_score = silhouette_score(X_tsne, cluster_labels)
# Calculate how well clusters match true labels (ARI)
from sklearn.metrics import adjusted_rand_score
ari_score = adjusted_rand_score(y, cluster_labels)
results.append({
'perplexity': perplexity,
'learning_rate': learning_rate,
'silhouette_score': sil_score,
'ari_score': ari_score,
'kl_divergence': tsne.kl_divergence_,
'X_tsne': X_tsne
})
print(f" Silhouette: {sil_score:.3f}, ARI: {ari_score:.3f}, KL: {tsne.kl_divergence_:.2f}")
except Exception as e:
print(f" Error: {e}")
continue
# Find best parameters
best_result = max(results, key=lambda x: x['silhouette_score'])
print(f"\nBest parameters:")
print(f" Perplexity: {best_result['perplexity']}")
print(f" Learning Rate: {best_result['learning_rate']}")
print(f" Silhouette Score: {best_result['silhouette_score']:.3f}")
print(f" ARI Score: {best_result['ari_score']:.3f}")
# Visualize parameter effects
df_results = pd.DataFrame(results)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Perplexity effect (fixed learning rate = 200)
perp_results = df_results[df_results['learning_rate'] == 200]
if not perp_results.empty:
axes[0, 0].plot(perp_results['perplexity'], perp_results['silhouette_score'], 'bo-')
axes[0, 0].set_xlabel('Perplexity')
axes[0, 0].set_ylabel('Silhouette Score')
axes[0, 0].set_title('Perplexity Effect (LR=200)')
axes[0, 0].grid(True, alpha=0.3)
# Learning rate effect (fixed perplexity = 30)
lr_results = df_results[df_results['perplexity'] == 30]
if not lr_results.empty:
axes[0, 1].plot(lr_results['learning_rate'], lr_results['silhouette_score'], 'ro-')
axes[0, 1].set_xlabel('Learning Rate')
axes[0, 1].set_ylabel('Silhouette Score')
axes[0, 1].set_title('Learning Rate Effect (Perp=30)')
axes[0, 1].set_xscale('log')
axes[0, 1].grid(True, alpha=0.3)
# KL divergence vs quality
axes[0, 2].scatter(df_results['kl_divergence'], df_results['silhouette_score'],
c=df_results['perplexity'], cmap='viridis', s=60)
axes[0, 2].set_xlabel('KL Divergence')
axes[0, 2].set_ylabel('Silhouette Score')
axes[0, 2].set_title('Convergence vs Quality')
plt.colorbar(axes[0, 2].collections[0], ax=axes[0, 2], label='Perplexity')
axes[0, 2].grid(True, alpha=0.3)
# Show best visualizations
# Best result
axes[1, 0].scatter(best_result['X_tsne'][:, 0], best_result['X_tsne'][:, 1],
c=y, cmap='tab10', alpha=0.7, s=20)
axes[1, 0].set_title(f'Best Result\\n(Perp={best_result["perplexity"]}, LR={best_result["learning_rate"]})')
# Comparison with different parameters
worst_result = min(results, key=lambda x: x['silhouette_score'])
axes[1, 1].scatter(worst_result['X_tsne'][:, 0], worst_result['X_tsne'][:, 1],
c=y, cmap='tab10', alpha=0.7, s=20)
axes[1, 1].set_title(f'Worst Result\\n(Perp={worst_result["perplexity"]}, LR={worst_result["learning_rate"]})')
# Parameter space heatmap
pivot_data = df_results.pivot_table(values='silhouette_score',
index='perplexity',
columns='learning_rate',
fill_value=np.nan)
im = axes[1, 2].imshow(pivot_data.values, cmap='viridis', aspect='auto')
axes[1, 2].set_xticks(range(len(pivot_data.columns)))
axes[1, 2].set_xticklabels(pivot_data.columns)
axes[1, 2].set_yticks(range(len(pivot_data.index)))
axes[1, 2].set_yticklabels(pivot_data.index)
axes[1, 2].set_xlabel('Learning Rate')
axes[1, 2].set_ylabel('Perplexity')
axes[1, 2].set_title('Parameter Heatmap')
plt.colorbar(im, ax=axes[1, 2], label='Silhouette Score')
plt.tight_layout()
plt.show()
return results, best_resultdef feature_selection_vs_extraction():
"""Compare feature selection and feature extraction methods"""
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import cross_val_score
# Generate dataset with mix of informative and noisy features
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=50,
n_informative=15, n_redundant=10,
n_clusters_per_class=1, random_state=42)
print("Feature Selection vs Feature Extraction Comparison:")
print("="*55)
print(f"Original dataset shape: {X.shape}")
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Define methods
methods = {
'Original': None,
'Filter Selection (f_classif)': 'f_classif',
'Wrapper Selection (RFE)': 'rfe',
'PCA (15 components)': 'pca_15',
'LDA (2 components)': 'lda_2'
}
results = {}
for method_name, method_type in methods.items():
print(f"\nTesting {method_name}...")
if method_type is None:
# No reduction
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
X_train_processed = X_train
X_test_processed = X_test
elif method_type == 'f_classif':
# Filter selection
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(f_classif, k=10)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'rfe':
# Wrapper selection
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', RFE(LogisticRegression(random_state=42, max_iter=1000),
n_features_to_select=10)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'pca_15':
# PCA with 15 components
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=15, random_state=42)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'lda_2':
# LDA with 2 components
pipeline = Pipeline([
('scaler', StandardScaler()),
('lda', LDA(n_components=2)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Get effective dimensionality
if method_type is None:
n_features_used = X.shape[1]
elif 'pca' in pipeline.named_steps:
n_features_used = pipeline.named_steps['pca'].n_components_
elif 'lda' in pipeline.named_steps:
n_features_used = pipeline.named_steps['lda'].scalings_.shape[1]
elif 'selector' in pipeline.named_steps:
n_features_used = pipeline.named_steps['selector'].k
else:
n_features_used = X.shape[1]
results[method_name] = {
'accuracy': accuracy,
'n_features': n_features_used,
'pipeline': pipeline,
'predictions': y_pred
}
print(f" Accuracy: {accuracy:.4f}")
print(f" Features used: {n_features_used}")
print(f" Reduction: {(1 - n_features_used/X.shape[1])*100:.1f}%")
# Step 3: Detailed Analysis
print(f"\nStep 3: Detailed Analysis")
print("-" * 23)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Performance comparison
method_names = list(results.keys())
accuracies = [results[m]['accuracy'] for m in method_names]
n_features = [results[m]['n_features'] for m in method_names]
bars = axes[0, 0].bar(range(len(method_names)), accuracies, alpha=0.7)
axes[0, 0].set_xticks(range(len(method_names)))
axes[0, 0].set_xticklabels([name.split()[0] for name in method_names], rotation=45)
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Method Performance Comparison')
axes[0, 0].grid(True, alpha=0.3)
# Add value labels
for bar, acc in zip(bars, accuracies):
axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{acc:.3f}', ha='center', va='bottom', fontsize=9)
# Feature count comparison
bars = axes[0, 1].bar(range(len(method_names)), n_features, alpha=0.7)
axes[0, 1].set_xticks(range(len(method_names)))
axes[0, 1].set_xticklabels([name.split()[0] for name in method_names], rotation=45)
axes[0, 1].set_ylabel('Number of Features')
axes[0, 1].set_title('Dimensionality Comparison')
axes[0, 1].grid(True, alpha=0.3)
# PCA analysis (if available)
pca_pipeline = results['PCA (95% var)']['pipeline']
if 'pca' in pca_pipeline.named_steps:
pca = pca_pipeline.named_steps['pca']
# Explained variance
axes[0, 2].plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'bo-')
axes[0, 2].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[0, 2].set_xlabel('Component')
axes[0, 2].set_ylabel('Cumulative Variance Explained')
axes[0, 2].set_title('PCA Variance Analysis')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)
# Feature importance in first 2 PCs
pc1_importance = np.abs(pca.components_[0])
pc2_importance = np.abs(pca.components_[1])
top_features_pc1 = np.argsort(pc1_importance)[-10:]
top_features_pc2 = np.argsort(pc2_importance)[-10:]
axes[1, 0].barh(range(10), pc1_importance[top_features_pc1])
axes[1, 0].set_yticks(range(10))
axes[1, 0].set_yticklabels([feature_names[i][:15] for i in top_features_pc1], fontsize=8)
axes[1, 0].set_xlabel('Importance Score')
axes[1, 0].set_title('PC1 - Top Feature Contributions')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 1].barh(range(10), pc2_importance[top_features_pc2])
axes[1, 1].set_yticks(range(10))
axes[1, 1].set_yticklabels([feature_names[i][:15] for i in top_features_pc2], fontsize=8)
axes[1, 1].set_xlabel('Importance Score')
axes[1, 1].set_title('PC2 - Top Feature Contributions')
axes[1, 1].grid(True, alpha=0.3)
# Accuracy vs Dimensionality trade-off
axes[1, 2].scatter(n_features, accuracies, s=100, alpha=0.7)
for i, method in enumerate(method_names):
axes[1, 2].annotate(method.split()[0], (n_features[i], accuracies[i]),
xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 2].set_xlabel('Number of Features')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Accuracy vs Dimensionality Trade-off')
axes[1, 2].grid(True, alpha=0.3)
# Method recommendations
axes[1, 2].axis('off')
# Find best method
best_method = max(results.keys(), key=lambda k: results[k]['accuracy'])
most_efficient = min(results.keys(), key=lambda k: results[k]['n_features']
if results[k]['accuracy'] > 0.9 else float('inf'))
recommendations = f"""
RECOMMENDATIONS:
Best Accuracy: {best_method}
✅ Accuracy: {results[best_method]['accuracy']:.4f}
✅ Features: {results[best_method]['n_features']}
Most Efficient: {most_efficient}
✅ Accuracy: {results[most_efficient]['accuracy']:.4f}
✅ Features: {results[most_efficient]['n_features']}
✅ Reduction: {(1-results[most_efficient]['n_features']/X.shape[1])*100:.1f}%
INSIGHTS:
✅ {len(high_corr_pairs)} highly correlated features
✅ PCA effective for noise reduction
✅ LDA good for classification tasks
✅ Feature selection preserves interpretability
"""
axes[1, 2].text(0.05, 0.95, recommendations, transform=axes[1, 2].transAxes,
fontsize=10, verticalalignment='top', fontfamily='monospace',
bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
plt.tight_layout()
plt.show()
# Detailed classification reports
print(f"\n{'='*60}")
print("DETAILED CLASSIFICATION REPORTS:")
print(f"{'='*60}")
for method_name, result in results.items():
print(f"\n{method_name}:")
print("-" * len(method_name))
print(classification_report(y_test, result['predictions'], target_names=data.target_names))
return resultsdef create_visualization_pipeline():
"""Create comprehensive visualization pipeline for different data types"""
from sklearn.datasets import (load_breast_cancer, load_wine, load_digits,
fetch_olivetti_faces)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# Load diverse datasets
datasets = {
'Breast Cancer': load_breast_cancer(),
'Wine': load_wine(),
'Digits': load_digits(),
'Faces': fetch_olivetti_faces()
}
# Create visualization pipeline
def visualize_dataset(name, dataset, max_samples=1000):
"""Visualize single dataset with multiple methods"""
X, y = dataset.data, dataset.target
# Sample data if too large
if len(X) > max_samples:
indices = np.random.choice(len(X), max_samples, replace=False)
X, y = X[indices], y[indices]
print(f"\nProcessing {name} dataset:")
print(f" Shape: {X.shape}")
print(f" Classes: {len(np.unique(y))}")
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply different reduction methods
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
# LDA (limit components to n_classes - 1)
n_components_lda = min(len(np.unique(y)) - 1, 2)
if n_components_lda > 0:
lda = LDA(n_components=n_components_lda)
X_lda = lda.fit_transform(X_scaled, y)
else:
X_lda = None
# t-SNE (with PCA preprocessing if high-dimensional)
if X_scaled.shape[1] > 50:
pca_pre = PCA(n_components=50, random_state=42)
X_for_tsne = pca_pre.fit_transform(X_scaled)
else:
X_for_tsne = X_scaled
tsne = TSNE(n_components=2, perplexity=min(30, len(X)//4),
random_state=42, verbose=0)
X_tsne = tsne.fit_transform(X_for_tsne)
# Create visualization
fig, axes = plt.subplots(1, 4, figsize=(20, 5))
# Original data (first 2 features)
if X.shape[1] >= 2:
scatter = axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[0].set_title('Original (First 2 Features)')
feature_names = getattr(dataset, 'feature_names', ['Feature 0', 'Feature 1'])
axes[0].set_xlabel(feature_names[0][:20])
axes[0].set_ylabel(feature_names[1][:20])
else:
axes[0].text(0.5, 0.5, 'Not applicable', ha='center', va='center',
transform=axes[0].transAxes)
axes[0].set_title('Original Data')
# PCA
scatter = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[1].set_title(f'PCA ({pca.explained_variance_ratio_.sum():.1%} var)')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
# LDA
if X_lda is not None:
if X_lda.shape[1] >= 2:
scatter = axes[2].scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[2].set_xlabel('LD1')
axes[2].set_ylabel('LD2')
else:
# 1D LDA
scatter = axes[2].scatter(X_lda[:, 0], np.random.normal(0, 0.1, len(X_lda)),
c=y, cmap='tab10', alpha=0.7)
axes[2].set_xlabel('LD1')
axes[2].set_ylabel('Random Jitter')
axes[2].set_title('LDA')
else:
axes[2].text(0.5, 0.5, 'Single class', ha='center', va='center',
transform=axes[2].transAxes)
axes[2].set_title('LDA (Not applicable)')
# t-SNE
scatter = axes[3].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[3].set_title('t-SNE')
axes[3].set_xlabel('t-SNE 1')
axes[3].set_ylabel('t-SNE 2')
# Add colorbar to last plot
plt.colorbar(scatter, ax=axes[3])
for ax in axes:
ax.grid(True, alpha=0.3)
plt.suptitle(f'{name} Dataset Visualization Comparison', fontsize=16)
plt.tight_layout()
plt.show()
return {
'pca_variance': pca.explained_variance_ratio_.sum(),
'n_components_lda': n_components_lda,
'tsne_kl': tsne.kl_divergence_
}
# Process all datasets
results = {}
for name, dataset in datasets.items():
results[name] = visualize_dataset(name, dataset)
return results
visualization_results = create_visualization_pipeline()def interactive_dim_reduction_explorer(X, y, feature_names=None):
"""Interactive explorer for different dimensionality reduction techniques"""
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
print("Interactive Dimensionality Reduction Explorer:")
print("="*45)
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Methods configuration
methods_config = {
'PCA': {
'method': PCA,
'params': {'n_components': 2, 'random_state': 42},
'requires_y': False
},
'LDA': {
'method': LDA,
'params': {'n_components': min(len(np.unique(y))-1, 2)},
'requires_y': True
},
't-SNE (Perp=5)': {
'method': TSNE,
'params': {'n_components': 2, 'perplexity': 5, 'random_state': 42, 'verbose': 0},
'requires_y': False
},
't-SNE (Perp=30)': {
'method': TSNE,
'params': {'n_components': 2, 'perplexity': 30, 'random_state': 42, 'verbose': 0},
'requires_y': False
},
't-SNE (Perp=50)': {
'method': TSNE,
'params': {'n_components': 2, 'perplexity': 50, 'random_state': 42, 'verbose': 0},
'requires_y': False
}
}
# Apply all methods
results = {}
valid_methods = {}
for name, config in methods_config.items():
try:
print(f"Applying {name}...")
method_class = config['method']
params = config['params']
# Skip if invalid configuration
if name.startswith('LDA') and params['n_components'] <= 0:
print(f" Skipped (insufficient classes)")
continue
if name.startswith('t-SNE') and params['perplexity'] >= len(X) / 3:
print(f" Skipped (perplexity too large)")
continue
method = method_class(**params)
if config['requires_y']:
X_transformed = method.fit_transform(X_scaled, y)
else:
X_transformed = method.fit_transform(X_scaled)
results[name] = X_transformed
valid_methods[name] = method
print(f" ✅ Success")
except Exception as e:
print(f" ✅ Failed: {e}")
continue
# Create comprehensive visualization
n_methods = len(results)
cols = min(3, n_methods)
rows = (n_methods + cols - 1) // cols
fig, axes = plt.subplots(rows, cols, figsize=(5*cols, 5*rows))
if n_methods == 1:
axes = [axes]
elif rows == 1:
axes = axes.reshape(1, -1)
axes_flat = axes.flatten()
for i, (name, X_transformed) in enumerate(results.items()):
ax = axes_flat[i]
# Create scatter plot
scatter = ax.scatter(X_transformed[:, 0], X_transformed[:, 1],
c=y, cmap='tab10', alpha=0.7, s=30)
ax.set_title(name)
ax.grid(True, alpha=0.3)
# Set axis labels based on method
if name.startswith('PCA'):
method = valid_methods[name]
ax.set_xlabel(f'PC1 ({method.explained_variance_ratio_[0]:.1%})')
ax.set_ylabel(f'PC2 ({method.explained_variance_ratio_[1]:.1%})')
elif name.startswith('LDA'):
ax.set_xlabel('LD1')
ax.set_ylabel('LD2')
else: # t-SNE
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
# Hide unused subplots
for i in range(n_methods, len(axes_flat)):
axes_flat[i].axis('off')
# Add colorbar
if n_methods > 0:
plt.colorbar(scatter, ax=axes_flat[n_methods-1])
plt.tight_layout()
plt.show()
# Performance analysis
print(f"\nMethod Analysis:")
print("-" * 40)
for name, method in valid_methods.items():
if hasattr(method, 'explained_variance_ratio_'):
variance_explained = method.explained_variance_ratio_.sum()
print(f"{name:20s}: {variance_explained:.1%} variance explained")
elif hasattr(method, 'kl_divergence_'):
print(f"{name:20s}: KL divergence = {method.kl_divergence_:.2f}")
else:
print(f"{name:20s}: Method applied successfully")
return results, valid_methodsdef ml_pipeline_with_dim_reduction():
"""Complete ML pipeline integrating dimensionality reduction"""
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Generate high-dimensional dataset
X, y = make_classification(n_samples=1000, n_features=100,
n_informative=20, n_redundant=30,
n_classes=3, random_state=42)
print("ML Pipeline with Dimensionality Reduction:")
print("="*45)
print(f"Dataset shape: {X.shape}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
# Define different pipeline configurations
pipelines = {
'Baseline (No Reduction)': Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
]),
'PCA + Random Forest': Pipeline([
('scaler', StandardScaler()),
('pca', PCA(random_state=42)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
]),
'LDA + SVM': Pipeline([
('scaler', StandardScaler()),
('lda', LDA()),
('classifier', SVC(random_state=42))
]),
'PCA + Logistic Regression': Pipeline([
('scaler', StandardScaler()),
('pca', PCA(random_state=42)),
('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
}
# Parameter grids for hyperparameter tuning
param_grids = {
'Baseline (No Reduction)': {
'classifier__n_estimators': [50, 100],
'classifier__max_depth': [10, 20, None]
},
'PCA + Random Forest': {
'pca__n_components': [10, 20, 50],
'classifier__n_estimators': [50, 100],
'classifier__max_depth': [10, 20]
},
'LDA + SVM': {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['linear', 'rbf']
},
'PCA + Logistic Regression': {
'pca__n_components': [10, 20, 50],
'classifier__C': [0.1, 1, 10]
}
}
# Train and evaluate pipelines
results = {}
for name, pipeline in pipelines.items():
print(f"\nTraining {name}...")
# Grid search with cross-validation
param_grid = param_grids[name]
grid_search = GridSearchCV(pipeline, param_grid, cv=3,
scoring='accuracy', n_jobs=-1, verbose=0)
# Fit and predict
grid_search.fit(X_train, y_train)
y_pred = grid_search.predict(X_test)
# Store results
results[name] = {
'best_params': grid_search.best_params_,
'best_score': grid_search.best_score_,
'test_score': grid_search.score(X_test, y_test),
'predictions': y_pred,
'model': grid_search.best_estimator_
}
print(f" Best CV score: {grid_search.best_score_:.4f}")
print(f" Test score: {grid_search.score(X_test, y_test):.4f}")
print(f" Best params: {grid_search.best_params_}")
# Comprehensive comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Performance comparison
pipeline_names = list(results.keys())
cv_scores = [results[name]['best_score'] for name in pipeline_names]
test_scores = [results[name]['test_score'] for name in pipeline_names]
errors = [results[name]['std'] for name in pipeline_names]
x = np.arange(len(pipeline_names))
width = 0.35
bars1 = axes[0, 0].bar(x - width/2, cv_scores, width, label='CV Score', alpha=0.8)
bars2 = axes[0, 0].bar(x + width/2, test_scores, width, label='Test Score', alpha=0.8)
axes[0, 0].set_xlabel('Pipeline')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Performance Comparison')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels([name.split()[0] for name in pipeline_names], rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Add value labels
for bars in [bars1, bars2]:
for bar in bars:
height = bar.get_height()
axes[0, 0].text(bar.get_x() + bar.get_width()/2., height + 0.01,
f'{height:.3f}', ha='center', va='bottom', fontsize=8)
# Feature importance analysis (for PCA pipelines)
pca_pipeline = results['PCA + Random Forest']['model']
if hasattr(pca_pipeline.named_steps['pca'], 'components_'):
# Get PCA components and feature importance
pca_components = pca_pipeline.named_steps['pca'].components_
rf_importance = pca_pipeline.named_steps['classifier'].feature_importances_
# Calculate original feature importance through PCA
original_importance = np.abs(pca_components.T @ rf_importance)
# Show top 15 features
top_features = np.argsort(original_importance)[-15:]
axes[0, 1].barh(range(15), original_importance[top_features])
axes[0, 1].set_yticks(range(15))
axes[0, 1].set_yticklabels([f'Feature {i}' for i in top_features])
axes[0, 1].set_xlabel('Importance Score')
axes[0, 1].set_title('PCA Feature Importance')
axes[0, 1].grid(True, alpha=0.3)
# Accuracy vs Dimensionality trade-off
axes[1, 0].scatter(n_features, accuracies, s=100, alpha=0.7)
for i, method in enumerate(method_names):
axes[1, 0].annotate(method.split()[0], (n_features[i], accuracies[i]),
xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 0].set_xlabel('Number of Features')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_title('Accuracy vs Dimensionality Trade-off')
axes[1, 0].grid(True, alpha=0.3)
# Method recommendations
axes[1, 1].axis('off')
# Find best method
best_method = max(results.keys(), key=lambda k: results[k]['accuracy'])
most_efficient = min(results.keys(), key=lambda k: results[k]['n_features']
if results[k]['accuracy'] > 0.9 else float('inf'))
recommendations = f"""
RECOMMENDATIONS:
Best Accuracy: {best_method}
✅ Accuracy: {results[best_method]['accuracy']:.4f}
✅ Features: {results[best_method]['n_features']}
Most Efficient: {most_efficient}
✅ Accuracy: {results[most_efficient]['accuracy']:.4f}
✅ Features: {results[most_efficient]['n_features']}
✅ Reduction: {(1-results[most_efficient]['n_features']/X.shape[1])*100:.1f}%
INSIGHTS:
✅ {len(high_corr_pairs)} highly correlated features
✅ PCA effective for noise reduction
✅ LDA good for classification tasks
✅ Feature selection preserves interpretability
"""
axes[1, 1].text(0.05, 0.95, recommendations, transform=axes[1, 1].transAxes,
fontsize=10, verticalalignment='top', fontfamily='monospace',
bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
plt.tight_layout()
plt.show()
# Detailed classification reports
print(f"\n{'='*60}")
print("DETAILED CLASSIFICATION REPORTS:")
print(f"{'='*60}")
for method_name, result in results.items():
print(f"\n{method_name}:")
print("-" * len(method_name))
print(classification_report(y_test, result['predictions'], target_names=data.target_names))
return resultsdef analyze_computational_efficiency():
"""Analyze computational efficiency of different dimensionality reduction methods"""
import time
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, IncrementalPCA, TruncatedSVD
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.feature_selection import SelectKBest, f_classif
print("Computational Efficiency Analysis:")
print("="*35)
# Test different dataset sizes
dataset_sizes = [
(500, 50),
(1000, 100),
(2000, 200),
(5000, 500)
]
methods = {
'PCA': lambda X, y: PCA(n_components=min(20, X.shape[1]//2)).fit_transform(X),
'Incremental PCA': lambda X, y: IncrementalPCA(n_components=min(20, X.shape[1]//2),
batch_size=100).fit_transform(X),
'Truncated SVD': lambda X, y: TruncatedSVD(n_components=min(20, X.shape[1]//2)).fit_transform(X),
'LDA': lambda X, y: LDA(n_components=min(len(np.unique(y))-1, 10)).fit_transform(X, y),
'SelectKBest': lambda X, y: SelectKBest(f_classif, k=min(20, X.shape[1]//2)).fit_transform(X, y),
't-SNE (small)': lambda X, y: TSNE(n_components=2, perplexity=min(30, len(X)//4),
verbose=0).fit_transform(X) if len(X) <= 1000 else None
}
results = []
for n_samples, n_features in dataset_sizes:
print(f"\nDataset size: {n_samples} × {n_features}")
# Generate data
X, y = make_classification(n_samples=n_samples, n_features=n_features,
n_informative=min(20, n_features//2),
n_classes=5, random_state=42)
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
for method_name, method_func in methods.items():
try:
print(f" Testing {method_name}...", end=' ')
# Measure time
start_time = time.time()
result = method_func(X_scaled, y)
end_time = time.time()
if result is not None:
execution_time = end_time - start_time
memory_usage = result.nbytes / (1024**2) # MB
results.append({
'method': method_name,
'n_samples': n_samples,
'n_features': n_features,
'time': execution_time,
'memory_mb': memory_usage,
'output_shape': result.shape
})
print(f"{execution_time:.3f}s, {memory_usage:.2f}MB")
else:
print("Skipped (too large)")
except Exception as e:
print(f"Error: {e}")
continue
# Analyze results
df_results = pd.DataFrame(results)
if not df_results.empty:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Execution time vs dataset size
for method in df_results['method'].unique():
method_data = df_results[df_results['method'] == method]
dataset_complexity = method_data['n_samples'] * method_data['n_features']
axes[0, 0].loglog(dataset_complexity, method_data['time'],
'o-', label=method, alpha=0.7)
axes[0, 0].set_xlabel('Dataset Complexity (samples × features)')
axes[0, 0].set_ylabel('Execution Time (seconds)')
axes[0, 0].set_title('Scalability Analysis')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Memory usage comparison
for method in df_results['method'].unique():
method_data = df_results[df_results['method'] == method]
axes[0, 1].plot(method_data['n_samples'], method_data['memory_mb'],
'o-', label=method, alpha=0.7)
axes[0, 1].set_xlabel('Number of Samples')
axes[0, 1].set_ylabel('Memory Usage (MB)')
axes[0, 1].set_title('Memory Consumption')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Method comparison for largest dataset
largest_dataset = df_results[df_results['n_samples'] == max(df_results['n_samples'])]
if not largest_dataset.empty:
methods_subset = largest_dataset['method'].tolist()
times_subset = largest_dataset['time'].tolist()
bars = axes[1, 0].bar(methods_subset, times_subset, alpha=0.7)
axes[1, 0].set_ylabel('Execution Time (seconds)')
axes[1, 0].set_title(f'Performance on Largest Dataset ({max(df_results["n_samples"])}×{max(df_results["n_features"])})')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)
# Add value labels
for bar, time_val in zip(bars, times_subset):
axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{time_val:.3f}s', ha='center', va='bottom', fontsize=8)
# Efficiency ratio (output features / input features) vs time
df_results['efficiency_ratio'] = df_results.apply(
lambda row: row['output_shape'][1] / row['n_features'], axis=1
)
scatter = axes[1, 1].scatter(df_results['efficiency_ratio'], df_results['time'],
c=df_results['n_samples'], cmap='viridis',
s=60, alpha=0.7)
axes[1, 1].set_xlabel('Compression Ratio (output/input features)')
axes[1, 1].set_ylabel('Execution Time (seconds)')
axes[1, 1].set_title('Compression vs Speed Trade-off')
plt.colorbar(scatter, ax=axes[1, 1], label='Number of Samples')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Performance recommendations
print(f"\n{'='*50}")
print("PERFORMANCE RECOMMENDATIONS:")
print(f"{'='*50}")
# Find fastest method for each dataset size
for size in df_results[['n_samples', 'n_features']].drop_duplicates().values:
n_samples, n_features = size
size_data = df_results[(df_results['n_samples'] == n_samples) &
(df_results['n_features'] == n_features)]
if not size_data.empty:
fastest_method = size_data.loc[size_data['time'].idxmin(), 'method']
fastest_time = size_data['time'].min()
print(f"For {n_samples}×{n_features}: {fastest_method} ({fastest_time:.3f}s)")
# General recommendations
print(f"\nGeneral Guidelines:")
avg_times = df_results.groupby('method')['time'].mean().sort_values()
print(f"✅ Fastest overall: {avg_times.index[0]}")
print(f"✅ Most scalable: Incremental PCA (for very large datasets)")
print(f"✅ Best for visualization: t-SNE (but expensive)")
print(f"✅ Good balance: PCA or Truncated SVD")
return df_resultsdef analyze_computational_efficiency():
"""Analyze computational efficiency of different dimensionality reduction methods"""
import time
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, IncrementalPCA, TruncatedSVD
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.feature_selection import SelectKBest, f_classif
print("Computational Efficiency Analysis:")
print("="*35)
# Test different dataset sizes
dataset_sizes = [
(500, 50),
(1000, 100),
(2000, 200),
(5000, 500)
]
methods = {
'PCA': lambda X, y: PCA(n_components=min(20, X.shape[1]//2)).fit_transform(X),
'Incremental PCA': lambda X, y: IncrementalPCA(n_components=min(20, X.shape[1]//2),
batch_size=100).fit_transform(X),
'Truncated SVD': lambda X, y: TruncatedSVD(n_components=min(20, X.shape[1]//2)).fit_transform(X),
'LDA': lambda X, y: LDA(n_components=min(len(np.unique(y))-1, 10)).fit_transform(X, y),
'SelectKBest': lambda X, y: SelectKBest(f_classif, k=min(20, X.shape[1]//2)).fit_transform(X, y),
't-SNE (small)': lambda X, y: TSNE(n_components=2, perplexity=min(30, len(X)//4),
verbose=0).fit_transform(X) if len(X) <= 1000 else None
}
results = []
for n_samples, n_features in dataset_sizes:
print(f"\nDataset size: {n_samples} × {n_features}")
# Generate data
X, y = make_classification(n_samples=n_samples, n_features=n_features,
n_informative=min(20, n_features//2),
n_classes=5, random_state=42)
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
for method_name, method_func in methods.items():
try:
print(f" Testing {method_name}...", end=' ')
# Measure time
start_time = time.time()
result = method_func(X_scaled, y)
end_time = time.time()
if result is not None:
execution_time = end_time - start_time
memory_usage = result.nbytes / (1024**2) # MB
results.append({
'method': method_name,
'n_samples': n_samples,
'n_features': n_features,
'time': execution_time,
'memory_mb': memory_usage,
'output_shape': result.shape
})
print(f"{execution_time:.3f}s, {memory_usage:.2f}MB")
else:
print("Skipped (too large)")
except Exception as e:
print(f"Error: {e}")
continue
# Analyze results
df_results = pd.DataFrame(results)
if not df_results.empty:
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Execution time vs dataset size
for method in df_results['method'].unique():
method_data = df_results[df_results['method'] == method]
dataset_complexity = method_data['n_samples'] * method_data['n_features']
axes[0, 0].loglog(dataset_complexity, method_data['time'],
'o-', label=method, alpha=0.7)
axes[0, 0].set_xlabel('Dataset Complexity (samples × features)')
axes[0, 0].set_ylabel('Execution Time (seconds)')
axes[0, 0].set_title('Scalability Analysis')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Memory usage comparison
for method in df_results['method'].unique():
method_data = df_results[df_results['method'] == method]
axes[0, 1].plot(method_data['n_samples'], method_data['memory_mb'],
'o-', label=method, alpha=0.7)
axes[0, 1].set_xlabel('Number of Samples')
axes[0, 1].set_ylabel('Memory Usage (MB)')
axes[0, 1].set_title('Memory Consumption')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Method comparison for largest dataset
largest_dataset = df_results[df_results['n_samples'] == max(df_results['n_samples'])]
if not largest_dataset.empty:
methods_subset = largest_dataset['method'].tolist()
times_subset = largest_dataset['time'].tolist()
bars = axes[1, 0].bar(methods_subset, times_subset, alpha=0.7)
axes[1, 0].set_ylabel('Execution Time (seconds)')
axes[1, 0].set_title(f'Performance on Largest Dataset ({max(df_results["n_samples"])}×{max(df_results["n_features"])})')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].grid(True, alpha=0.3)
# Add value labels
for bar, time_val in zip(bars, times_subset):
axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{time_val:.3f}s', ha='center', va='bottom', fontsize=8)
# Efficiency ratio (output features / input features) vs time
df_results['efficiency_ratio'] = df_results.apply(
lambda row: row['output_shape'][1] / row['n_features'], axis=1
)
scatter = axes[1, 1].scatter(df_results['efficiency_ratio'], df_results['time'],
c=df_results['n_samples'], cmap='viridis',
s=60, alpha=0.7)
axes[1, 1].set_xlabel('Compression Ratio (output/input features)')
axes[1, 1].set_ylabel('Execution Time (seconds)')
axes[1, 1].set_title('Compression vs Speed Trade-off')
plt.colorbar(scatter, ax=axes[1, 1], label='Number of Samples')
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Performance recommendations
print(f"\n{'='*50}")
print("PERFORMANCE RECOMMENDATIONS:")
print(f"{'='*50}")
# Find fastest method for each dataset size
for size in df_results[['n_samples', 'n_features']].drop_duplicates().values:
n_samples, n_features = size
size_data = df_results[(df_results['n_samples'] == n_samples) &
(df_results['n_features'] == n_features)]
if not size_data.empty:
fastest_method = size_data.loc[size_data['time'].idxmin(), 'method']
fastest_time = size_data['time'].min()
print(f"For {n_samples}×{n_features}: {fastest_method} ({fastest_time:.3f}s)")
# General recommendations
print(f"\nGeneral Guidelines:")
avg_times = df_results.groupby('method')['time'].mean().sort_values()
print(f"✅ Fastest overall: {avg_times.index[0]}")
print(f"✅ Most scalable: Incremental PCA (for very large datasets)")
print(f"✅ Best for visualization: t-SNE (but expensive)")
print(f"✅ Good balance: PCA or Truncated SVD")
return df_resultsdef image_compression_lab():
"""Hands-on lab: Image compression using PCA"""
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
print("Lab 1: Image Compression with PCA")
print("="*35)
# Load face dataset
faces = fetch_olivetti_faces()
X_faces = faces.data # 400 faces, each 64×64 pixels (4096 features)
print(f"Dataset shape: {X_faces.shape}")
print(f"Image resolution: 64×64 pixels")
# Select a single face for compression analysis
face_idx = 0
original_face = X_faces[face_idx].reshape(64, 64)
# Test different numbers of components
n_components_list = [10, 25, 50, 100, 200, 500]
# Apply PCA to entire dataset
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_faces)
# Analyze compression results
compression_results = []
fig, axes = plt.subplots(2, len(n_components_list), figsize=(20, 8))
for i, n_comp in enumerate(n_components_list):
# Reconstruct using n components
pca = PCA(n_components=n_comp, random_state=42)
X_pca = pca.fit_transform(X_faces)
X_reconstructed = pca.inverse_transform(X_pca)
# Reconstruct the selected face
reconstructed_face = X_reconstructed[face_idx].reshape(64, 64)
# Calculate metrics
mse = np.mean((original_face - reconstructed_face) ** 2)
variance_explained = np.sum(pca.explained_variance_ratio_)
compression_ratio = 4096 / (n_comp + n_comp * 4096 / 400) # Approximate
compression_results.append({
'n_components': n_comp,
'mse': mse,
'variance_explained': variance_explained,
'compression_ratio': compression_ratio
})
# Display original and reconstructed
axes[0, i].imshow(original_face, cmap='gray')
axes[0, i].set_title(f'Original')
axes[0, i].axis('off')
axes[1, i].imshow(reconstructed_face, cmap='gray')
axes[1, i].set_title(f'{n_comp} comp\\nMSE: {mse:.4f}\\nVar: {variance_explained:.1%}')
axes[1, i].axis('off')
print(f"{n_comp:3d} components: MSE={mse:.4f}, Variance={variance_explained:.1%}, Compression={compression_ratio:.1f}x")
plt.tight_layout()
plt.show()
# Analysis plots
df_compression = pd.DataFrame(compression_results)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# MSE vs Components
axes[0].plot(df_compression['n_components'], df_compression['mse'], 'ro-')
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('Mean Squared Error')
axes[0].set_title('Reconstruction Error vs Components')
axes[0].grid(True, alpha=0.3)
# Variance Explained
axes[1].plot(df_compression['n_components'], df_compression['variance_explained'], 'bo-')
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Variance Explained')
axes[1].set_title('Information Retention vs Components')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Compression Trade-off
axes[2].scatter(df_compression['compression_ratio'], df_compression['mse'],
c=df_compression['n_components'], cmap='viridis', s=100)
axes[2].set_xlabel('Compression Ratio')
axes[2].set_ylabel('Reconstruction Error (MSE)')
axes[2].set_title('Compression vs Quality Trade-off')
# Add component labels
for _, row in df_compression.iterrows():
axes[2].annotate(f"{row['n_components']}",
(row['compression_ratio'], row['mse']),
xytext=(5, 5), textcoords='offset points', fontsize=8)
plt.colorbar(axes[2].collections[0], ax=axes[2], label='Components')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return df_compression
compression_lab_results = image_compression_lab()def dimensionality_reduction_pipeline_lab():
"""Lab 2: Building a complete dimensionality reduction pipeline"""
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
print("Lab 2: Complete Dimensionality Reduction Pipeline")
print("="*50)
# Load real dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
print(f"Dataset: {data.DESCR.split(':', 1)[0]}")
print(f"Shape: {X.shape}")
print(f"Classes: {len(np.unique(y))} ({np.bincount(y)})")
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
stratify=y, random_state=42)
# Step 1: Exploratory Data Analysis
print(f"\nStep 1: Exploratory Data Analysis")
print("-" * 35)
# Analyze feature correlations
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
correlation_matrix = np.corrcoef(X_scaled.T)
high_corr_pairs = []
for i in range(len(correlation_matrix)):
for j in range(i+1, len(correlation_matrix)):
if abs(correlation_matrix[i, j]) > 0.8:
high_corr_pairs.append((i, j, correlation_matrix[i, j]))
print(f"Highly correlated feature pairs (>0.8): {len(high_corr_pairs)}")
# Step 2: Compare Dimensionality Reduction Methods
print(f"\nStep 2: Method Comparison")
print("-" * 25)
methods = {
'No Reduction': None,
'PCA (95% var)': 'pca_95',
'PCA (10 comp)': 'pca_10',
'LDA': 'lda',
'SelectKBest (10)': 'select_10'
}
results = {}
for method_name, method_type in methods.items():
print(f"\nTesting {method_name}...")
if method_type is None:
# No reduction
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
X_train_processed = X_train
X_test_processed = X_test
elif method_type == 'pca_95':
# PCA with 95% variance
pca = PCA(random_state=42)
X_scaled_train = StandardScaler().fit_transform(X_train)
pca.fit(X_scaled_train)
# Find components for 95% variance
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_comp_95 = np.argmax(cumvar >= 0.95) + 1
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_comp_95, random_state=42)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'pca_10':
# PCA with 10 components
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10, random_state=42)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'lda':
# LDA
pipeline = Pipeline([
('scaler', StandardScaler()),
('lda', LDA()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'select_10':
# Feature selection
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(f_classif, k=10)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Get effective dimensionality
if method_type is None:
n_features_used = X.shape[1]
elif 'pca' in pipeline.named_steps:
n_features_used = pipeline.named_steps['pca'].n_components_
elif 'lda' in pipeline.named_steps:
n_features_used = pipeline.named_steps['lda'].scalings_.shape[1]
elif 'selector' in pipeline.named_steps:
n_features_used = pipeline.named_steps['selector'].k
else:
n_features_used = X.shape[1]
results[method_name] = {
'accuracy': accuracy,
'n_features': n_features_used,
'pipeline': pipeline,
'predictions': y_pred
}
print(f" Accuracy: {accuracy:.4f}")
print(f" Features used: {n_features_used}")
print(f" Reduction: {(1 - n_features_used/X.shape[1])*100:.1f}%")
# Step 3: Detailed Analysis
print(f"\nStep 3: Detailed Analysis")
print("-" * 23)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Performance comparison
method_names = list(results.keys())
accuracies = [results[m]['accuracy'] for m in method_names]
n_features = [results[m]['n_features'] for m in method_names]
bars = axes[0, 0].bar(range(len(method_names)), accuracies, alpha=0.7)
axes[0, 0].set_xticks(range(len(method_names)))
axes[0, 0].set_xticklabels([name.split()[0] for name in method_names], rotation=45)
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Method Performance Comparison')
axes[0, 0].grid(True, alpha=0.3)
# Add value labels
for bar, acc in zip(bars, accuracies):
axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{acc:.3f}', ha='center', va='bottom', fontsize=9)
# Feature count comparison
bars = axes[0, 1].bar(range(len(method_names)), n_features, alpha=0.7)
axes[0, 1].set_xticks(range(len(method_names)))
axes[0, 1].set_xticklabels([name.split()[0] for name in method_names], rotation=45)
axes[0, 1].set_ylabel('Number of Features')
axes[0, 1].set_title('Dimensionality Comparison')
axes[0, 1].grid(True, alpha=0.3)
# PCA analysis (if available)
pca_pipeline = results['PCA (95% var)']['pipeline']
if 'pca' in pca_pipeline.named_steps:
pca = pca_pipeline.named_steps['pca']
# Explained variance
axes[0, 2].plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'bo-')
axes[0, 2].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[0, 2].set_xlabel('Component')
axes[0, 2].set_ylabel('Cumulative Variance Explained')
axes[0, 2].set_title('PCA Variance Analysis')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)
# Feature importance in first 2 PCs
pc1_importance = np.abs(pca.components_[0])
pc2_importance = np.abs(pca.components_[1])
top_features_pc1 = np.argsort(pc1_importance)[-10:]
top_features_pc2 = np.argsort(pc2_importance)[-10:]
axes[1, 0].barh(range(10), pc1_importance[top_features_pc1])
axes[1, 0].set_yticks(range(10))
axes[1, 0].set_yticklabels([feature_names[i][:15] for i in top_features_pc1], fontsize=8)
axes[1, 0].set_xlabel('Importance Score')
axes[1, 0].set_title('PC1 - Top Feature Contributions')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 1].barh(range(10), pc2_importance[top_features_pc2])
axes[1, 1].set_yticks(range(10))
axes[1, 1].set_yticklabels([feature_names[i][:15] for i in top_features_pc2], fontsize=8)
axes[1, 1].set_xlabel('Importance Score')
axes[1, 1].set_title('PC2 - Top Feature Contributions')
axes[1, 1].grid(True, alpha=0.3)
# Accuracy vs Dimensionality trade-off
axes[1, 2].scatter(n_features, accuracies, s=100, alpha=0.7)
for i, method in enumerate(method_names):
axes[1, 2].annotate(method.split()[0], (n_features[i], accuracies[i]),
xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 2].set_xlabel('Number of Features')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Accuracy vs Dimensionality Trade-off')
axes[1, 2].grid(True, alpha=0.3)
# Method recommendations
axes[1, 2].axis('off')
# Find best method
best_method = max(results.keys(), key=lambda k: results[k]['accuracy'])
most_efficient = min(results.keys(), key=lambda k: results[k]['n_features']
if results[k]['accuracy'] > 0.9 else float('inf'))
recommendations = f"""
RECOMMENDATIONS:
Best Accuracy: {best_method}
✅ Accuracy: {results[best_method]['accuracy']:.4f}
✅ Features: {results[best_method]['n_features']}
Most Efficient: {most_efficient}
✅ Accuracy: {results[most_efficient]['accuracy']:.4f}
✅ Features: {results[most_efficient]['n_features']}
✅ Reduction: {(1-results[most_efficient]['n_features']/X.shape[1])*100:.1f}%
INSIGHTS:
✅ {len(high_corr_pairs)} highly correlated features
✅ PCA effective for noise reduction
✅ LDA good for classification tasks
✅ Feature selection preserves interpretability
"""
axes[1, 2].text(0.05, 0.95, recommendations, transform=axes[1, 2].transAxes,
fontsize=10, verticalalignment='top', fontfamily='monospace',
bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
plt.tight_layout()
plt.show()
# Detailed classification reports
print(f"\n{'='*60}")
print("DETAILED CLASSIFICATION REPORTS:")
print(f"{'='*60}")
for method_name, result in results.items():
print(f"\n{method_name}:")
print("-" * len(method_name))
print(classification_report(y_test, result['predictions'], target_names=data.target_names))
return resultsdef image_compression_lab():
"""Hands-on lab: Image compression using PCA"""
from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
print("Lab 1: Image Compression with PCA")
print("="*35)
# Load face dataset
faces = fetch_olivetti_faces()
X_faces = faces.data # 400 faces, each 64×64 pixels (4096 features)
print(f"Dataset shape: {X_faces.shape}")
print(f"Image resolution: 64×64 pixels")
# Select a single face for compression analysis
face_idx = 0
original_face = X_faces[face_idx].reshape(64, 64)
# Test different numbers of components
n_components_list = [10, 25, 50, 100, 200, 500]
# Apply PCA to entire dataset
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_faces)
# Analyze compression results
compression_results = []
fig, axes = plt.subplots(2, len(n_components_list), figsize=(20, 8))
for i, n_comp in enumerate(n_components_list):
# Reconstruct using n components
pca = PCA(n_components=n_comp, random_state=42)
X_pca = pca.fit_transform(X_faces)
X_reconstructed = pca.inverse_transform(X_pca)
# Reconstruct the selected face
reconstructed_face = X_reconstructed[face_idx].reshape(64, 64)
# Calculate metrics
mse = np.mean((original_face - reconstructed_face) ** 2)
variance_explained = np.sum(pca.explained_variance_ratio_)
compression_ratio = 4096 / (n_comp + n_comp * 4096 / 400) # Approximate
compression_results.append({
'n_components': n_comp,
'mse': mse,
'variance_explained': variance_explained,
'compression_ratio': compression_ratio
})
# Display original and reconstructed
axes[0, i].imshow(original_face, cmap='gray')
axes[0, i].set_title(f'Original')
axes[0, i].axis('off')
axes[1, i].imshow(reconstructed_face, cmap='gray')
axes[1, i].set_title(f'{n_comp} comp\\nMSE: {mse:.4f}\\nVar: {variance_explained:.1%}')
axes[1, i].axis('off')
print(f"{n_comp:3d} components: MSE={mse:.4f}, Variance={variance_explained:.1%}, Compression={compression_ratio:.1f}x")
plt.tight_layout()
plt.show()
# Analysis plots
df_compression = pd.DataFrame(compression_results)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# MSE vs Components
axes[0].plot(df_compression['n_components'], df_compression['mse'], 'ro-')
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('Mean Squared Error')
axes[0].set_title('Reconstruction Error vs Components')
axes[0].grid(True, alpha=0.3)
# Variance Explained
axes[1].plot(df_compression['n_components'], df_compression['variance_explained'], 'bo-')
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Variance Explained')
axes[1].set_title('Information Retention vs Components')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Compression Trade-off
axes[2].scatter(df_compression['compression_ratio'], df_compression['mse'],
c=df_compression['n_components'], cmap='viridis', s=100)
axes[2].set_xlabel('Compression Ratio')
axes[2].set_ylabel('Reconstruction Error (MSE)')
axes[2].set_title('Compression vs Quality Trade-off')
# Add component labels
for _, row in df_compression.iterrows():
axes[2].annotate(f"{row['n_components']}",
(row['compression_ratio'], row['mse']),
xytext=(5, 5), textcoords='offset points', fontsize=8)
plt.colorbar(axes[2].collections[0], ax=axes[2], label='Components')
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return df_compression
compression_lab_results = image_compression_lab()def dimensionality_reduction_pipeline_lab():
"""Lab 2: Building a complete dimensionality reduction pipeline"""
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
print("Lab 2: Complete Dimensionality Reduction Pipeline")
print("="*50)
# Load real dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
print(f"Dataset: {data.DESCR.split(':', 1)[0]}")
print(f"Shape: {X.shape}")
print(f"Classes: {len(np.unique(y))} ({np.bincount(y)})")
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
stratify=y, random_state=42)
# Step 1: Exploratory Data Analysis
print(f"\nStep 1: Exploratory Data Analysis")
print("-" * 35)
# Analyze feature correlations
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
correlation_matrix = np.corrcoef(X_scaled.T)
high_corr_pairs = []
for i in range(len(correlation_matrix)):
for j in range(i+1, len(correlation_matrix)):
if abs(correlation_matrix[i, j]) > 0.8:
high_corr_pairs.append((i, j, correlation_matrix[i, j]))
print(f"Highly correlated feature pairs (>0.8): {len(high_corr_pairs)}")
# Step 2: Compare Dimensionality Reduction Methods
print(f"\nStep 2: Method Comparison")
print("-" * 25)
methods = {
'No Reduction': None,
'PCA (95% var)': 'pca_95',
'PCA (10 comp)': 'pca_10',
'LDA': 'lda',
'SelectKBest (10)': 'select_10'
}
results = {}
for method_name, method_type in methods.items():
print(f"\nTesting {method_name}...")
if method_type is None:
# No reduction
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
X_train_processed = X_train
X_test_processed = X_test
elif method_type == 'pca_95':
# PCA with 95% variance
pca = PCA(random_state=42)
X_scaled_train = StandardScaler().fit_transform(X_train)
pca.fit(X_scaled_train)
# Find components for 95% variance
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_comp_95 = np.argmax(cumvar >= 0.95) + 1
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_comp_95, random_state=42)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'pca_10':
# PCA with 10 components
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10, random_state=42)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'lda':
# LDA
pipeline = Pipeline([
('scaler', StandardScaler()),
('lda', LDA()),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
elif method_type == 'select_10':
# Feature selection
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(f_classif, k=10)),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Get effective dimensionality
if method_type is None:
n_features_used = X.shape[1]
elif 'pca' in pipeline.named_steps:
n_features_used = pipeline.named_steps['pca'].n_components_
elif 'lda' in pipeline.named_steps:
n_features_used = pipeline.named_steps['lda'].scalings_.shape[1]
elif 'selector' in pipeline.named_steps:
n_features_used = pipeline.named_steps['selector'].k
else:
n_features_used = X.shape[1]
results[method_name] = {
'accuracy': accuracy,
'n_features': n_features_used,
'pipeline': pipeline,
'predictions': y_pred
}
print(f" Accuracy: {accuracy:.4f}")
print(f" Features used: {n_features_used}")
print(f" Reduction: {(1 - n_features_used/X.shape[1])*100:.1f}%")
# Step 3: Detailed Analysis
print(f"\nStep 3: Detailed Analysis")
print("-" * 23)
# Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Performance comparison
method_names = list(results.keys())
accuracies = [results[m]['accuracy'] for m in method_names]
n_features = [results[m]['n_features'] for m in method_names]
bars = axes[0, 0].bar(range(len(method_names)), accuracies, alpha=0.7)
axes[0, 0].set_xticks(range(len(method_names)))
axes[0, 0].set_xticklabels([name.split()[0] for name in method_names], rotation=45)
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Method Performance Comparison')
axes[0, 0].grid(True, alpha=0.3)
# Add value labels
for bar, acc in zip(bars, accuracies):
axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{acc:.3f}', ha='center', va='bottom', fontsize=9)
# Feature count comparison
bars = axes[0, 1].bar(range(len(method_names)), n_features, alpha=0.7)
axes[0, 1].set_xticks(range(len(method_names)))
axes[0, 1].set_xticklabels([name.split()[0] for name in method_names], rotation=45)
axes[0, 1].set_ylabel('Number of Features')
axes[0, 1].set_title('Dimensionality Comparison')
axes[0, 1].grid(True, alpha=0.3)
# PCA analysis (if available)
pca_pipeline = results['PCA (95% var)']['pipeline']
if 'pca' in pca_pipeline.named_steps:
pca = pca_pipeline.named_steps['pca']
# Explained variance
axes[0, 2].plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), 'bo-')
axes[0, 2].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[0, 2].set_xlabel('Component')
axes[0, 2].set_ylabel('Cumulative Variance Explained')
axes[0, 2].set_title('PCA Variance Analysis')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)
# Feature importance in first 2 PCs
pc1_importance = np.abs(pca.components_[0])
pc2_importance = np.abs(pca.components_[1])
top_features_pc1 = np.argsort(pc1_importance)[-10:]
top_features_pc2 = np.argsort(pc2_importance)[-10:]
axes[1, 0].barh(range(10), pc1_importance[top_features_pc1])
axes[1, 0].set_yticks(range(10))
axes[1, 0].set_yticklabels([feature_names[i][:15] for i in top_features_pc1], fontsize=8)
axes[1, 0].set_xlabel('Importance Score')
axes[1, 0].set_title('PC1 - Top Feature Contributions')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 1].barh(range(10), pc2_importance[top_features_pc2])
axes[1, 1].set_yticks(range(10))
axes[1, 1].set_yticklabels([feature_names[i][:15] for i in top_features_pc2], fontsize=8)
axes[1, 1].set_xlabel('Importance Score')
axes[1, 1].set_title('PC2 - Top Feature Contributions')
axes[1, 1].grid(True, alpha=0.3)
# Accuracy vs Dimensionality trade-off
axes[1, 2].scatter(n_features, accuracies, s=100, alpha=0.7)
for i, method in enumerate(method_names):
axes[1, 2].annotate(method.split()[0], (n_features[i], accuracies[i]),
xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 2].set_xlabel('Number of Features')
axes[1, 2].set_ylabel('Accuracy')
axes[1, 2].set_title('Accuracy vs Dimensionality Trade-off')
axes[1, 2].grid(True, alpha=0.3)
# Method recommendations
axes[1, 2].axis('off')
# Find best method
best_method = max(results.keys(), key=lambda k: results[k]['accuracy'])
most_efficient = min(results.keys(), key=lambda k: results[k]['n_features']
if results[k]['accuracy'] > 0.9 else float('inf'))
recommendations = f"""
RECOMMENDATIONS:
Best Accuracy: {best_method}
✅ Accuracy: {results[best_method]['accuracy']:.4f}
✅ Features: {results[best_method]['n_features']}
Most Efficient: {most_efficient}
✅ Accuracy: {results[most_efficient]['accuracy']:.4f}
✅ Features: {results[most_efficient]['n_features']}
✅ Reduction: {(1-results[most_efficient]['n_features']/X.shape[1])*100:.1f}%
INSIGHTS:
✅ {len(high_corr_pairs)} highly correlated features
✅ PCA effective for noise reduction
✅ LDA good for classification tasks
✅ Feature selection preserves interpretability
"""
axes[1, 2].text(0.05, 0.95, recommendations, transform=axes[1, 2].transAxes,
fontsize=10, verticalalignment='top', fontfamily='monospace',
bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
plt.tight_layout()
plt.show()
# Detailed classification reports
print(f"\n{'='*60}")
print("DETAILED CLASSIFICATION REPORTS:")
print(f"{'='*60}")
for method_name, result in results.items():
print(f"\n{method_name}:")
print("-" * len(method_name))
print(classification_report(y_test, result['predictions'], target_names=data.target_names))
return results# Exercise 7.1: Implement PCA with different initialization methods
def exercise_pca_implementation():
"""
Exercise: Implement PCA with SVD and compare with eigendecomposition
Tasks:
1. Implement PCA using SVD (Singular Value Decomposition)
2. Compare with eigendecomposition method
3. Analyze numerical stability and performance
"""
# TODO: Implement SVD-based PCA
class PCA_SVD:
def __init__(self, n_components=None):
self.n_components = n_components
# TODO: Initialize other necessary attributes
def fit(self, X):
# TODO: Implement PCA using SVD
# Hint: Use np.linalg.svd()
pass
def transform(self, X):
# TODO: Transform data using learned components
pass
def fit_transform(self, X):
return self.fit(X).transform(X)
# TODO: Compare with eigendecomposition method
# TODO: Test numerical stability with different datasets
# TODO: Performance comparison
print("Exercise 7.1: Implement and test your PCA_SVD class")
# Exercise 7.2: t-SNE Parameter Exploration
def exercise_tsne_exploration():
"""
Exercise: Comprehensive t-SNE parameter exploration
Tasks:
1. Implement parameter grid search for t-SNE
2. Create evaluation metrics for visualization quality
3. Build interactive parameter selector
"""
# TODO: Load different datasets (digits, faces, synthetic)
# TODO: Implement grid search over perplexity and learning rate
# TODO: Define visualization quality metrics
# TODO: Create recommendation system for parameters
print("Exercise 7.2: Explore t-SNE parameters systematically")
# Exercise 7.3: Dimensionality Reduction for Streaming Data
def exercise_streaming_pca():
"""
Exercise: Implement streaming/online PCA
Tasks:
1. Implement incremental PCA from scratch
2. Compare with batch PCA on streaming data
3. Analyze memory usage and accuracy trade-offs
"""
# TODO: Implement online PCA algorithm
# TODO: Simulate streaming data scenario
# TODO: Compare accuracy with batch processing
# TODO: Analyze computational and memory efficiency
print("Exercise 7.3: Implement streaming PCA algorithm")✅ Curse of Dimensionality Understanding - Identified high-dimensional data problems and their impact - Analyzed distance concentration and sparsity issues - Evaluated computational complexity considerations - Developed diagnostic tools for high-dimensional datasets
✅ Principal Component Analysis (PCA) Mastery - Understood mathematical foundations (eigendecomposition, covariance) - Implemented PCA from scratch with complete derivation - Mastered component selection techniques (elbow method, cross-validation) - Applied PCA for noise reduction, compression, and visualization
✅ Advanced Dimensionality Reduction Techniques - Implemented Linear Discriminant Analysis (LDA) for supervised reduction - Mastered t-SNE for non-linear visualization with parameter optimization - Compared feature selection vs feature extraction approaches - Developed intelligent method selection frameworks
✅ Practical Applications and Integration - Built complete ML pipelines with dimensionality reduction preprocessing - Analyzed computational efficiency and scalability considerations - Created comprehensive visualization and exploration tools - Implemented real-world case studies (image compression, data visualization)
💻 Implementation Expertise - From-scratch algorithm implementations with mathematical rigor - Efficient computation techniques for large datasets - Parameter optimization and hyperparameter tuning - Pipeline integration with scikit-learn ecosystem
📊 Analysis and Evaluation - Comprehensive evaluation metrics and quality assessment - Trade-off analysis (compression vs accuracy, speed vs quality) - Method selection based on data characteristics and requirements - Performance benchmarking and scalability analysis
🔧 Practical Tools - Interactive exploration and visualization frameworks - Automated method recommendation systems - Computational efficiency analysis tools - Real-world application pipelines
🖼️ Computer Vision - Image compression and noise reduction - Feature extraction for image classification - Facial recognition preprocessing
📈 Data Science and Analytics - Exploratory data analysis and visualization - High-dimensional data preprocessing - Pattern recognition and cluster analysis
🔬 Scientific Computing - Genomics and bioinformatics applications - Signal processing and noise reduction - Experimental data analysis
When to Use Each Method:
| Method | Best For | Avoid When |
|---|---|---|
| PCA | General preprocessing, noise reduction, visualization | Need interpretable features |
| LDA | Classification tasks, supervised reduction | Single class or unlabeled data |
| t-SNE | Exploratory visualization, cluster discovery | Preprocessing for ML, large datasets |
| Feature Selection | Model interpretability, regulatory compliance | High noise, redundant features |
Implementation Recommendations:
✅ Common Mistakes: - Applying PCA to non-scaled data - Using t-SNE for preprocessing instead of visualization - Selecting components based on arbitrary thresholds - Ignoring computational complexity for large datasets
✅ Best Practices: - Scale features appropriately for each method - Use PCA for preprocessing, t-SNE for visualization - Select components based on downstream task performance - Consider incremental/streaming methods for scalability
The dimensionality reduction techniques covered integrate seamlessly with: - Classification and Regression: Preprocessing to improve performance and reduce overfitting - Clustering: Visualization and noise reduction for better cluster discovery - Feature Engineering: Automated feature creation and selection - Model Deployment: Efficient models with reduced computational requirements
This chapter provides foundation for: - Deep Learning: Understanding of autoencoders and representation learning - Manifold Learning: Advanced non-linear dimensionality reduction - Big Data: Distributed and streaming dimensionality reduction - Domain-Specific Applications: Specialized techniques for images, text, and signals
End of Chapter 7: Dimensionality Reduction
Next: Chapter 8 will focus on End-to-End Machine Learning Projects, integrating all learned techniques into complete, real-world applications with proper methodology and deployment considerations.
By the end of this chapter, you will have evolved from a student of algorithms to a conductor of intelligent systems: - Orchestrate complete ML symphonies using CRISP-DM methodology as your musical score - Design resilient pipelines that gracefully handle the chaos of real-world data - Transform business whispers into algorithmic solutions that sing with clarity - Navigate the complex dance between statistical rigor and business pragmatism - Deploy ML solutions that don’t just work in notebooks, but thrive in production storms - Craft documentation and processes that tell the story of your analytical journey
“In theory, there is no difference between theory and practice. In practice, there is.” ✅ Yogi Berra
Welcome to the most exhilarating chapter of our journey✅where the elegant mathematical theories we’ve mastered meet the wild, unpredictable, and utterly fascinating world of real problems. This is where data scientists are truly born, not in the comfort of clean datasets and perfect algorithms, but in the trenches of missing values, shifting business requirements, and the eternal question: “Will it work on Monday morning?”
Imagine you’re a master craftsperson in an ancient guild, but instead of forging steel or weaving tapestries, you’re creating intelligent systems that solve real human problems. Each project is a masterpiece waiting to be discovered, hidden within the raw materials of data, business needs, and computational constraints.
This chapter is your apprenticeship in ML craftsmanship✅where we don’t just build models, we architect solutions that endure, evolve, and enchant. We’ll embark on four extraordinary quests:
🏠 The Housing Oracle: Predicting real estate prices
with the wisdom of statistical learning
📊 The Stock Market Whisperer: Dancing with financial
time series and market psychology
💼 The Employee Loyalty Detective: Solving the mystery
of workforce retention
🛍️ The Customer Journey Archaeologist: Uncovering the
hidden tribes within your customer base
This isn’t just about connecting code blocks✅it’s about developing the systems thinking that separates great data scientists from mere algorithm implementers. You’ll learn to see the invisible threads that connect business strategy to mathematical elegance, to anticipate failure modes before they occur, and to build solutions that grow more intelligent over time.
“A goal without a plan is just a wish. A plan without methodology is just hope.” ✅ Modern Data Science Proverb
In the swirling cosmos of data science, where infinite possibilities exist and every path seems equally valid, CRISP-DM emerges as your philosophical compass✅not just a methodology, but a way of thinking that has guided thousands of successful ML projects across industries and continents.
Think of CRISP-DM as the DNA of intelligent problem-solving. Just as DNA provides the blueprint for life, CRISP-DM provides the genetic code for transforming business questions into algorithmic answers. It’s not rigid scaffolding but adaptive wisdom✅a living framework that breathes with your project’s unique rhythms.
1. Business Understanding - The Art of Asking
the Right Questions
Where we transform vague hunches into precise, measurable objectives
2. Data Understanding - The Detective
Phase
Where we become data archaeologists, uncovering stories hidden in
numbers
3. Data Preparation - The Alchemical
Transformation
Where raw data becomes refined intelligence through careful
craftsmanship
4. Modeling - The Creative Laboratory
Where mathematical theories dance with computational reality
5. Evaluation - The Moment of Truth
Where we separate genuine insights from statistical mirages
6. Deployment - The Birth of
Intelligence
Where algorithms become living systems that serve human needs
CRISP-DM isn’t just about following steps✅it’s about developing intuition for the natural rhythms of discovery. Like a river that knows its path to the sea, great ML projects have an organic flow that CRISP-DM helps you recognize and honor.
The foundation of any successful ML project is a clear understanding of the business problem.
Key Activities: - Define business objectives - Assess the situation and resources - Determine data mining goals - Produce project plan
Example Framework:
class ProjectDefinition:
def __init__(self):
self.business_objective = ""
self.success_criteria = []
self.constraints = {}
self.risks = []
self.timeline = {}
self.stakeholders = []
def define_problem(self, objective, metrics, constraints=None):
"""Define the core business problem and success metrics"""
self.business_objective = objective
self.success_criteria = metrics
self.constraints = constraints or {}
return {
'problem_type': self._classify_problem_type(),
'data_requirements': self._estimate_data_needs(),
'success_metrics': self.success_criteria
}
def _classify_problem_type(self):
"""Classify the ML problem type based on business objective"""
keywords = self.business_objective.lower()
if any(word in keywords for word in ['predict', 'forecast', 'estimate']):
if any(word in keywords for word in ['category', 'class', 'type']):
return 'classification'
else:
return 'regression'
elif any(word in keywords for word in ['group', 'segment', 'cluster']):
return 'clustering'
elif any(word in keywords for word in ['recommend', 'suggest']):
return 'recommendation'
else:
return 'exploratory'Understanding your data is crucial for project success.
Data Assessment Checklist:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
class DataProfiler:
def __init__(self, df):
self.df = df
self.profile = {}
def generate_profile(self):
"""Generate comprehensive data profile"""
self.profile = {
'basic_info': self._get_basic_info(),
'missing_data': self._analyze_missing_data(),
'data_types': self._analyze_data_types(),
'distributions': self._analyze_distributions(),
'correlations': self._analyze_correlations(),
'outliers': self._detect_outliers(),
'data_quality': self._assess_data_quality()
}
return self.profile
def _get_basic_info(self):
"""Get basic dataset information"""
return {
'shape': self.df.shape,
'memory_usage': self.df.memory_usage(deep=True).sum(),
'column_count': len(self.df.columns),
'row_count': len(self.df)
}
def _analyze_missing_data(self):
"""Analyze missing data patterns"""
missing_counts = self.df.isnull().sum()
missing_percentages = (missing_counts / len(self.df)) * 100
return {
'missing_counts': missing_counts[missing_counts > 0].to_dict(),
'missing_percentages': missing_percentages[missing_percentages > 0].to_dict(),
'missing_patterns': self._identify_missing_patterns()
}
def _identify_missing_patterns(self):
"""Identify patterns in missing data"""
# Create missing data indicator matrix
missing_matrix = self.df.isnull()
# Find common missing patterns
patterns = missing_matrix.value_counts().head(10)
return patterns.to_dict()
def _analyze_data_types(self):
"""Analyze data types and suggest improvements"""
type_analysis = {}
for column in self.df.columns:
col_type = str(self.df[column].dtype)
unique_count = self.df[column].nunique()
type_analysis[column] = {
'current_type': col_type,
'unique_values': unique_count,
'suggested_type': self._suggest_optimal_type(column),
'memory_optimization': self._suggest_memory_optimization(column)
}
return type_analysis
def _suggest_optimal_type(self, column):
"""Suggest optimal data type for a column"""
col = self.df[column]
if col.dtype == 'object':
# Check if it's actually numeric
try:
pd.to_numeric(col, errors='raise')
return 'numeric'
except:
if col.nunique() < len(col) * 0.05: # Less than 5% unique
return 'category'
else:
return 'string'
elif col.dtype in ['int64', 'float64']:
# Check if we can downcast
if col.dtype == 'int64':
if col.min() >= 0:
if col.max() < 256:
return 'uint8'
elif col.max() < 65536:
return 'uint16'
else:
return 'uint32'
else:
if col.min() >= -128 and col.max() < 128:
return 'int8'
elif col.min() >= -32768 and col.max() < 32768:
return 'int16'
else:
return 'int32'
return str(col.dtype) # Keep current type
def visualize_data_quality(self):
"""Create visualizations for data quality assessment"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Missing data heatmap
sns.heatmap(self.df.isnull(), cbar=True, ax=axes[0,0])
axes[0,0].set_title('Missing Data Pattern')
# Data type distribution
type_counts = self.df.dtypes.value_counts()
axes[0,1].pie(type_counts.values, labels=type_counts.index, autopct='%1.1f%%')
axes[0,1].set_title('Data Type Distribution')
# Correlation matrix
numeric_cols = self.df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 1:
corr_matrix = self.df[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[1,0])
axes[1,0].set_title('Correlation Matrix')
# Outlier detection summary
outlier_counts = {}
for col in numeric_cols:
Q1 = self.df[col].quantile(0.25)
Q3 = self.df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = ((self.df[col] < (Q1 - 1.5 * IQR)) |
(self.df[col] > (Q3 + 1.5 * IQR))).sum()
outlier_counts[col] = outliers
if outlier_counts:
axes[1,1].bar(outlier_counts.keys(), outlier_counts.values())
axes[1,1].set_title('Outlier Count by Column')
axes[1,1].tick_params(axis='x', rotation=45)
plt.tight_layout()
return fig
# Example usage
def demonstrate_data_profiling():
"""Demonstrate data profiling capabilities"""
# Generate sample dataset
np.random.seed(42)
n_samples = 1000
data = {
'age': np.random.randint(18, 80, n_samples),
'income': np.random.exponential(50000, n_samples),
'credit_score': np.random.normal(700, 100, n_samples),
'category': np.random.choice(['A', 'B', 'C'], n_samples),
'is_default': np.random.choice([0, 1], n_samples, p=[0.8, 0.2])
}
# Introduce missing values
df = pd.DataFrame(data)
missing_indices = np.random.choice(df.index, size=100, replace=False)
df.loc[missing_indices, 'income'] = np.nan
# Profile the data
profiler = DataProfiler(df)
profile = profiler.generate_profile()
print("Data Profile Summary:")
print(f"Shape: {profile['basic_info']['shape']}")
print(f"Memory Usage: {profile['basic_info']['memory_usage'] / 1024:.2f} KB")
print(f"Missing Data: {len(profile['missing_data']['missing_counts'])} columns affected")
return df, profileData preparation typically consumes 60-80% of project time but is crucial for model success.
Comprehensive Data Pipeline:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
class DataPreprocessor(BaseEstimator, TransformerMixin):
def __init__(self,
numeric_strategy='median',
categorical_strategy='most_frequent',
encoding_strategy='onehot',
scaling_strategy='standard',
outlier_treatment='iqr',
feature_selection=None):
self.numeric_strategy = numeric_strategy
self.categorical_strategy = categorical_strategy
self.encoding_strategy = encoding_strategy
self.scaling_strategy = scaling_strategy
self.outlier_treatment = outlier_treatment
self.feature_selection = feature_selection
self.numeric_pipeline = None
self.categorical_pipeline = None
self.preprocessor = None
def fit(self, X, y=None):
"""Fit preprocessing pipeline"""
# Identify column types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
# Build numeric pipeline
numeric_steps = []
# Outlier treatment
if self.outlier_treatment == 'iqr':
numeric_steps.append(('outlier_treatment', OutlierTreatment()))
# Imputation
if self.numeric_strategy == 'knn':
numeric_steps.append(('imputer', KNNImputer()))
else:
numeric_steps.append(('imputer', SimpleImputer(strategy=self.numeric_strategy)))
# Scaling
if self.scaling_strategy == 'standard':
numeric_steps.append(('scaler', StandardScaler()))
elif self.scaling_strategy == 'minmax':
from sklearn.preprocessing import MinMaxScaler
numeric_steps.append(('scaler', MinMaxScaler()))
self.numeric_pipeline = Pipeline(numeric_steps)
# Build categorical pipeline
categorical_steps = []
# Imputation
categorical_steps.append(('imputer',
SimpleImputer(strategy=self.categorical_strategy)))
# Encoding
if self.encoding_strategy == 'onehot':
categorical_steps.append(('encoder',
OneHotEncoder(drop='first', sparse=False)))
elif self.encoding_strategy == 'label':
categorical_steps.append(('encoder', LabelEncoder()))
self.categorical_pipeline = Pipeline(categorical_steps)
# Combine pipelines
self.preprocessor = ColumnTransformer([
('numeric', self.numeric_pipeline, numeric_features),
('categorical', self.categorical_pipeline, categorical_features)
], remainder='drop')
# Fit the preprocessor
self.preprocessor.fit(X, y)
return self
def transform(self, X):
"""Transform data using fitted pipeline"""
if self.preprocessor is None:
raise ValueError("Preprocessor must be fitted before transforming")
return self.preprocessor.transform(X)
def fit_transform(self, X, y=None):
"""Fit and transform data"""
return self.fit(X, y).transform(X)
class OutlierTreatment(BaseEstimator, TransformerMixin):
def __init__(self, method='iqr', threshold=1.5):
self.method = method
self.threshold = threshold
self.bounds = {}
def fit(self, X, y=None):
"""Calculate outlier bounds for each column"""
for column in X.columns:
if self.method == 'iqr':
Q1 = X[column].quantile(0.25)
Q3 = X[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - self.threshold * IQR
upper_bound = Q3 + self.threshold * IQR
elif self.method == 'zscore':
mean = X[column].mean()
std = X[column].std()
lower_bound = mean - self.threshold * std
upper_bound = mean + self.threshold * std
self.bounds[column] = (lower_bound, upper_bound)
return self
def transform(self, X):
"""Apply outlier treatment"""
X_transformed = X.copy()
for column, (lower_bound, upper_bound) in self.bounds.items():
X_transformed[column] = np.clip(X_transformed[column],
lower_bound, upper_bound)
return X_transformed
class FeatureEngineer(BaseEstimator, TransformerMixin):
def __init__(self,
create_interactions=False,
create_polynomials=False,
polynomial_degree=2,
create_ratios=False,
create_aggregations=False):
self.create_interactions = create_interactions
self.create_polynomials = create_polynomials
self.polynomial_degree = polynomial_degree
self.create_ratios = create_ratios
self.create_aggregations = create_aggregations
self.feature_names = []
self.numeric_columns = []
def fit(self, X, y=None):
"""Fit feature engineer"""
self.numeric_columns = X.select_dtypes(include=[np.number]).columns.tolist()
self.feature_names = X.columns.tolist()
return self
def transform(self, X):
"""Create engineered features"""
X_transformed = X.copy()
# Interaction features
if self.create_interactions and len(self.numeric_columns) >= 2:
for i, col1 in enumerate(self.numeric_columns):
for col2 in self.numeric_columns[i+1:]:
interaction_name = f"{col1}_{col2}_interaction"
X_transformed[interaction_name] = X[col1] * X[col2]
# Polynomial features
if self.create_polynomials:
for col in self.numeric_columns:
for degree in range(2, self.polynomial_degree + 1):
poly_name = f"{col}_poly_{degree}"
X_transformed[poly_name] = X[col] ** degree
# Ratio features
if self.create_ratios and len(self.numeric_columns) >= 2:
for i, col1 in enumerate(self.numeric_columns):
for col2 in self.numeric_columns[i+1:]:
ratio_name = f"{col1}_{col2}_ratio"
# Avoid division by zero
X_transformed[ratio_name] = X[col1] / (X[col2] + 1e-8)
return X_transformedThe modeling phase involves selecting appropriate algorithms, training models, and optimizing performance.
Model Selection Framework:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import xgboost as xgb
import lightgbm as lgb
from scipy.stats import randint, uniform
class ModelSelector:
def __init__(self, problem_type='classification', cv_folds=5, scoring=None):
self.problem_type = problem_type
self.cv_folds = cv_folds
self.scoring = scoring or self._get_default_scoring()
self.models = self._get_base_models()
self.results = {}
self.best_model = None
self.best_params = None
def _get_default_scoring(self):
"""Get default scoring metric based on problem type"""
if self.problem_type == 'classification':
return 'roc_auc'
elif self.problem_type == 'regression':
return 'r2'
else:
return 'accuracy'
def _get_base_models(self):
"""Get base models for comparison"""
if self.problem_type == 'classification':
return {
'logistic_regression': LogisticRegression(random_state=42),
'random_forest': RandomForestClassifier(random_state=42),
'gradient_boosting': GradientBoostingClassifier(random_state=42),
'svm': SVC(random_state=42, probability=True),
'naive_bayes': GaussianNB(),
'knn': KNeighborsClassifier(),
'xgboost': xgb.XGBClassifier(random_state=42),
'lightgbm': lgb.LGBMClassifier(random_state=42, verbosity=-1)
}
else:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
return {
'linear_regression': LinearRegression(),
'ridge_regression': Ridge(random_state=42),
'lasso_regression': Lasso(random_state=42),
'random_forest': RandomForestRegressor(random_state=42),
'gradient_boosting': GradientBoostingRegressor(random_state=42),
'svm': SVR(),
'xgboost': xgb.XGBRegressor(random_state=42),
'lightgbm': lgb.LGBMRegressor(random_state=42, verbosity=-1)
}
def compare_models(self, X, y):
"""Compare multiple models using cross-validation"""
print(f"Comparing models using {self.cv_folds}-fold cross-validation...")
print(f"Scoring metric: {self.scoring}")
print("-" * 60)
for name, model in self.models.items():
try:
scores = cross_val_score(model, X, y,
cv=self.cv_folds,
scoring=self.scoring,
n_jobs=-1)
self.results[name] = {
'scores': scores,
'mean_score': scores.mean(),
'std_score': scores.std(),
'model': model
}
print(f"{name:20s}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
except Exception as e:
print(f"{name:20s}: Error - {str(e)}")
# Sort by mean score (descending)
sorted_results = sorted(self.results.items(),
key=lambda x: x[1]['mean_score'],
reverse=True)
print("-" * 60)
print(f"Best model: {sorted_results[0][0]} with score: {sorted_results[0][1]['mean_score']:.4f}")
return sorted_results
def hyperparameter_tuning(self, X, y, model_name=None, search_type='random'):
"""Perform hyperparameter tuning for the best model or specified model"""
if model_name is None:
# Use the best model from comparison
if not self.results:
raise ValueError("No models have been compared yet. Run compare_models first.")
model_name = max(self.results.keys(), key=lambda k: self.results[k]['mean_score'])
model = self.models[model_name]
param_grid = self._get_param_grid(model_name)
print(f"Performing hyperparameter tuning for {model_name}...")
if search_type == 'grid':
search = GridSearchCV(model, param_grid,
cv=self.cv_folds,
scoring=self.scoring,
n_jobs=-1,
verbose=1)
else: # randomized search
search = RandomizedSearchCV(model, param_grid,
n_iter=50,
cv=self.cv_folds,
scoring=self.scoring,
n_jobs=-1,
verbose=1,
random_state=42)
search.fit(X, y)
self.best_model = search.best_estimator_
self.best_params = search.best_params_
print(f"Best parameters: {self.best_params}")
print(f"Best cross-validation score: {search.best_score_:.4f}")
return search
def _get_param_grid(self, model_name):
"""Get parameter grid for hyperparameter tuning"""
param_grids = {
'random_forest': {
'n_estimators': randint(50, 200),
'max_depth': [None, 10, 20, 30],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['auto', 'sqrt', 'log2']
},
'gradient_boosting': {
'n_estimators': randint(50, 200),
'learning_rate': uniform(0.01, 0.2),
'max_depth': randint(3, 10),
'subsample': uniform(0.6, 0.4)
},
'xgboost': {
'n_estimators': randint(50, 200),
'learning_rate': uniform(0.01, 0.2),
'max_depth': randint(3, 10),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.6, 0.4)
},
'lightgbm': {
'n_estimators': randint(50, 200),
'learning_rate': uniform(0.01, 0.2),
'max_depth': randint(3, 10),
'num_leaves': randint(10, 100),
'subsample': uniform(0.6, 0.4)
},
'logistic_regression': {
'C': uniform(0.001, 10),
'penalty': ['l1', 'l2'],
'solver': ['liblinear', 'saga']
},
'svm': {
'C': uniform(0.1, 10),
'gamma': ['scale', 'auto'] + [uniform(0.001, 1)],
'kernel': ['rbf', 'poly', 'sigmoid']
}
}
return param_grids.get(model_name, {})
class ModelEvaluator:
def __init__(self, problem_type='classification'):
self.problem_type = problem_type
self.evaluation_results = {}
def comprehensive_evaluation(self, model, X_test, y_test, X_train=None, y_train=None):
"""Perform comprehensive model evaluation"""
results = {}
# Make predictions
y_pred = model.predict(X_test)
if self.problem_type == 'classification':
y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
results = self._evaluate_classification(y_test, y_pred, y_pred_proba)
else:
results = self._evaluate_regression(y_test, y_pred)
# Add training performance if training data provided
if X_train is not None and y_train is not None:
y_train_pred = model.predict(X_train)
results['training_performance'] = self._calculate_training_metrics(y_train, y_train_pred)
results['overfitting_analysis'] = self._analyze_overfitting(results, y_train, y_train_pred)
self.evaluation_results = results
return results
def _evaluate_classification(self, y_true, y_pred, y_pred_proba=None):
"""Evaluate classification model"""
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
confusion_matrix, roc_curve, precision_recall_curve)
results = {
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred, average='weighted'),
'recall': recall_score(y_true, y_pred, average='weighted'),
'f1_score': f1_score(y_true, y_pred, average='weighted'),
'confusion_matrix': confusion_matrix(y_true, y_pred),
'classification_report': classification_report(y_true, y_pred, output_dict=True)
}
if y_pred_proba is not None:
results['roc_auc'] = roc_auc_score(y_true, y_pred_proba)
results['roc_curve'] = roc_curve(y_true, y_pred_proba)
results['precision_recall_curve'] = precision_recall_curve(y_true, y_pred_proba)
return results
def _evaluate_regression(self, y_true, y_pred):
"""Evaluate regression model"""
from sklearn.metrics import (mean_squared_error, mean_absolute_error,
r2_score, explained_variance_score)
results = {
'mse': mean_squared_error(y_true, y_pred),
'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
'mae': mean_absolute_error(y_true, y_pred),
'r2_score': r2_score(y_true, y_pred),
'explained_variance': explained_variance_score(y_true, y_pred)
}
# Add residual analysis
residuals = y_true - y_pred
results['residual_analysis'] = {
'mean_residual': np.mean(residuals),
'std_residual': np.std(residuals),
'residuals': residuals
}
return results
def visualize_performance(self, model, X_test, y_test):
"""Create performance visualizations"""
if self.problem_type == 'classification':
return self._plot_classification_results(model, X_test, y_test)
else:
return self._plot_regression_results(model, X_test, y_test)
def _plot_classification_results(self, model, X_test, y_test):
"""Create classification performance plots"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', ax=axes[0,0])
axes[0,0].set_title('Confusion Matrix')
axes[0,0].set_ylabel('True Label')
axes[0,0].set_xlabel('Predicted Label')
# ROC Curve
if y_pred_proba is not None:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
axes[0,1].plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[0,1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0,1].set_xlim([0.0, 1.0])
axes[0,1].set_ylim([0.0, 1.05])
axes[0,1].set_xlabel('False Positive Rate')
axes[0,1].set_ylabel('True Positive Rate')
axes[0,1].set_title('ROC Curve')
axes[0,1].legend(loc="lower right")
# Precision-Recall Curve
if y_pred_proba is not None:
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)
axes[1,0].plot(recall, precision, color='blue', lw=2,
label=f'PR curve (AP = {avg_precision:.2f})')
axes[1,0].set_xlabel('Recall')
axes[1,0].set_ylabel('Precision')
axes[1,0].set_title('Precision-Recall Curve')
axes[1,0].legend()
# Feature Importance (if available)
if hasattr(model, 'feature_importances_'):
feature_names = [f'Feature_{i}' for i in range(len(model.feature_importances_))]
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=True).tail(10)
axes[1,1].barh(importance_df['feature'], importance_df['importance'])
axes[1,1].set_title('Top 10 Feature Importances')
axes[1,1].set_xlabel('Importance')
plt.tight_layout()
return fig
def _plot_regression_results(self, model, X_test, y_test):
"""Create regression performance plots"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
y_pred = model.predict(X_test)
residuals = y_test - y_pred
# Actual vs Predicted
axes[0,0].scatter(y_test, y_pred, alpha=0.5)
axes[0,0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0,0].set_xlabel('Actual Values')
axes[0,0].set_ylabel('Predicted Values')
axes[0,0].set_title('Actual vs Predicted Values')
# Residuals vs Predicted
axes[0,1].scatter(y_pred, residuals, alpha=0.5)
axes[0,1].axhline(y=0, color='r', linestyle='--')
axes[0,1].set_xlabel('Predicted Values')
axes[0,1].set_ylabel('Residuals')
axes[0,1].set_title('Residuals vs Predicted Values')
# Residuals Distribution
axes[1,0].hist(residuals, bins=30, alpha=0.7)
axes[1,0].set_xlabel('Residuals')
axes[1,0].set_ylabel('Frequency')
axes[1,0].set_title('Distribution of Residuals')
# Q-Q Plot
from scipy.stats import probplot
probplot(residuals, dist="norm", plot=axes[1,1])
axes[1,1].set_title('Q-Q Plot of Residuals')
plt.tight_layout()
return figModel evaluation goes beyond simple metrics to assess business value and deployment readiness.
Comprehensive Evaluation Framework:
class BusinessValueEvaluator:
def __init__(self, business_metrics):
self.business_metrics = business_metrics
self.cost_benefit_analysis = {}
def calculate_business_impact(self, model_results, baseline_results=None):
"""Calculate business impact of the model"""
impact_analysis = {}
# Revenue Impact
if 'revenue_per_tp' in self.business_metrics:
tp = model_results.get('true_positives', 0)
revenue_impact = tp * self.business_metrics['revenue_per_tp']
impact_analysis['revenue_increase'] = revenue_impact
# Cost Savings
if 'cost_per_fp' in self.business_metrics:
fp = model_results.get('false_positives', 0)
cost_savings = fp * self.business_metrics['cost_per_fp']
impact_analysis['cost_reduction'] = cost_savings
# Efficiency Gains
if 'time_savings_per_prediction' in self.business_metrics:
total_predictions = model_results.get('total_predictions', 0)
time_savings = total_predictions * self.business_metrics['time_savings_per_prediction']
impact_analysis['time_savings_hours'] = time_savings
# ROI Calculation
if baseline_results:
improvement = self._calculate_improvement(model_results, baseline_results)
impact_analysis['performance_improvement'] = improvement
return impact_analysis
def deployment_readiness_check(self, model, evaluation_results):
"""Assess if model is ready for deployment"""
readiness_checklist = {
'performance_threshold': self._check_performance_threshold(evaluation_results),
'bias_fairness': self._check_bias_fairness(evaluation_results),
'robustness': self._check_robustness(model, evaluation_results),
'interpretability': self._check_interpretability(model),
'scalability': self._check_scalability(model),
'monitoring_setup': self._check_monitoring_setup()
}
# Calculate overall readiness score
readiness_score = sum(readiness_checklist.values()) / len(readiness_checklist)
return {
'readiness_score': readiness_score,
'checklist': readiness_checklist,
'recommendations': self._get_deployment_recommendations(readiness_checklist)
}
def _check_performance_threshold(self, results):
"""Check if model meets minimum performance requirements"""
# Define minimum thresholds (these should be business-specific)
min_thresholds = {
'accuracy': 0.80,
'precision': 0.75,
'recall': 0.70,
'f1_score': 0.75,
'roc_auc': 0.80
}
for metric, threshold in min_thresholds.items():
if metric in results and results[metric] < threshold:
return False
return True
def generate_model_card(self, model, evaluation_results, training_details):
"""Generate a comprehensive model card for documentation"""
model_card = {
'model_details': {
'model_type': type(model).__name__,
'model_version': training_details.get('version', '1.0'),
'training_date': training_details.get('training_date'),
'developer': training_details.get('developer'),
'intended_use': training_details.get('intended_use')
},
'performance_metrics': evaluation_results,
'training_data': {
'dataset_description': training_details.get('dataset_description'),
'data_size': training_details.get('data_size'),
'data_preprocessing': training_details.get('preprocessing_steps')
},
'evaluation_data': {
'test_size': training_details.get('test_size'),
'validation_strategy': training_details.get('validation_strategy')
},
'ethical_considerations': {
'bias_analysis': training_details.get('bias_analysis'),
'fairness_metrics': training_details.get('fairness_metrics'),
'limitations': training_details.get('limitations')
},
'deployment_considerations': {
'infrastructure_requirements': training_details.get('infrastructure_requirements'),
'monitoring_strategy': training_details.get('monitoring_strategy'),
'update_frequency': training_details.get('update_frequency')
}
}
return model_cardDeployment transforms a model from a research artifact into a production system.
MLOps Pipeline Implementation:
import joblib
import json
from datetime import datetime
import logging
from pathlib import Path
class ModelDeploymentPipeline:
def __init__(self, model_name, version="1.0"):
self.model_name = model_name
self.version = version
self.deployment_date = datetime.now()
self.model_registry = {}
# Setup logging
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger(__name__)
def package_model(self, model, preprocessor, feature_names, model_metadata):
"""Package model with all necessary components"""
model_package = {
'model': model,
'preprocessor': preprocessor,
'feature_names': feature_names,
'metadata': {
**model_metadata,
'model_name': self.model_name,
'version': self.version,
'deployment_date': self.deployment_date.isoformat(),
'model_type': type(model).__name__
}
}
return model_package
def save_model_artifacts(self, model_package, artifacts_dir="model_artifacts"):
"""Save all model artifacts to disk"""
artifacts_path = Path(artifacts_dir) / self.model_name / self.version
artifacts_path.mkdir(parents=True, exist_ok=True)
# Save model
model_path = artifacts_path / "model.joblib"
joblib.dump(model_package['model'], model_path)
# Save preprocessor
preprocessor_path = artifacts_path / "preprocessor.joblib"
joblib.dump(model_package['preprocessor'], preprocessor_path)
# Save metadata
metadata_path = artifacts_path / "metadata.json"
with open(metadata_path, 'w') as f:
json.dump(model_package['metadata'], f, indent=2, default=str)
# Save feature names
features_path = artifacts_path / "feature_names.json"
with open(features_path, 'w') as f:
json.dump(model_package['feature_names'], f, indent=2)
self.logger.info(f"Model artifacts saved to {artifacts_path}")
return artifacts_path
def create_prediction_api(self, model_package):
"""Create a simple Flask API for model predictions"""
flask_code = f'''
from flask import Flask, request, jsonify
import joblib
import pandas as pd
import numpy as np
import json
from pathlib import Path
app = Flask(__name__)
# Load model artifacts
MODEL_DIR = Path("model_artifacts/{self.model_name}/{self.version}")
model = joblib.load(MODEL_DIR / "model.joblib")
preprocessor = joblib.load(MODEL_DIR / "preprocessor.joblib")
with open(MODEL_DIR / "feature_names.json", 'r') as f:
feature_names = json.load(f)
with open(MODEL_DIR / "metadata.json", 'r') as f:
metadata = json.load(f)
@app.route('/predict', methods=['POST'])
def predict():
try:
# Get input data
data = request.json
# Convert to DataFrame
if isinstance(data, list):
df = pd.DataFrame(data)
else:
df = pd.DataFrame([data])
# Ensure all required features are present
missing_features = set(feature_names) - set(df.columns)
if missing_features:
return jsonify({{'error': f'Missing features: {{missing_features}}'}})
# Preprocess data
X_processed = preprocessor.transform(df[feature_names])
# Make predictions
predictions = model.predict(X_processed)
# Get prediction probabilities if available
if hasattr(model, 'predict_proba'):
probabilities = model.predict_proba(X_processed)
response = {{
'predictions': predictions.tolist(),
'probabilities': probabilities.tolist()
}}
else:
response = {{'predictions': predictions.tolist()}}
return jsonify(response)
except Exception as e:
return jsonify({{'error': str(e)}}), 400
@app.route('/model_info', methods=['GET'])
def model_info():
return jsonify(metadata)
@app.route('/health', methods=['GET'])
def health():
return jsonify({{'status': 'healthy', 'model': '{self.model_name}', 'version': '{self.version}'}})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=False)
'''
api_path = Path(f"api_{self.model_name}_{self.version}.py")
with open(api_path, 'w') as f:
f.write(flask_code)
self.logger.info(f"API code generated: {api_path}")
return api_path
class ModelMonitor:
def __init__(self, model_name, monitoring_config):
self.model_name = model_name
self.monitoring_config = monitoring_config
self.metrics_history = []
def log_prediction(self, input_data, prediction, actual=None, timestamp=None):
"""Log a prediction for monitoring"""
if timestamp is None:
timestamp = datetime.now()
log_entry = {
'timestamp': timestamp,
'input_data': input_data,
'prediction': prediction,
'actual': actual
}
self.metrics_history.append(log_entry)
def detect_drift(self, current_data, reference_data, method='ks_test'):
"""Detect data drift between current and reference datasets"""
from scipy.stats import ks_2samp
drift_results = {}
for column in current_data.columns:
if column in reference_data.columns:
if method == 'ks_test':
statistic, p_value = ks_2samp(
reference_data[column].dropna(),
current_data[column].dropna()
)
drift_results[column] = {
'statistic': statistic,
'p_value': p_value,
'drift_detected': p_value < 0.05
}
return drift_results
def calculate_performance_metrics(self, predictions, actuals):
"""Calculate performance metrics for monitoring"""
if len(predictions) != len(actuals):
raise ValueError("Predictions and actuals must have the same length")
# This would be customized based on problem type
from sklearn.metrics import accuracy_score, precision_score, recall_score
metrics = {
'accuracy': accuracy_score(actuals, predictions),
'precision': precision_score(actuals, predictions, average='weighted'),
'recall': recall_score(actuals, predictions, average='weighted'),
'sample_count': len(predictions),
'timestamp': datetime.now().isoformat()
}
return metricsScenario: A telecommunications company wants to predict which customers are likely to churn (cancel their service) in the next month to enable proactive retention efforts.
Business Objectives: - Reduce customer churn by 15% - Increase customer lifetime value - Optimize retention campaign targeting - Achieve model accuracy > 85%
# Project setup and data loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore')
class ChurnPredictionProject:
def __init__(self):
self.data = None
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
self.model = None
self.preprocessor = None
def load_and_explore_data(self):
"""Load and perform initial data exploration"""
# Generate synthetic telecom churn dataset
np.random.seed(42)
n_customers = 10000
# Customer demographics
customer_data = {
'customer_id': range(1, n_customers + 1),
'age': np.random.normal(45, 15, n_customers).astype(int),
'gender': np.random.choice(['M', 'F'], n_customers),
'tenure_months': np.random.exponential(24, n_customers).astype(int),
'monthly_charges': np.random.normal(70, 20, n_customers),
'total_charges': np.random.exponential(2000, n_customers),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'],
n_customers, p=[0.5, 0.3, 0.2]),
'payment_method': np.random.choice(['Credit card', 'Bank transfer', 'Electronic check', 'Mailed check'],
n_customers, p=[0.3, 0.2, 0.3, 0.2]),
'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'],
n_customers, p=[0.4, 0.4, 0.2]),
'phone_service': np.random.choice(['Yes', 'No'], n_customers, p=[0.9, 0.1]),
'multiple_lines': np.random.choice(['Yes', 'No'], n_customers, p=[0.5, 0.5]),
'online_security': np.random.choice(['Yes', 'No'], n_customers, p=[0.3, 0.7]),
'tech_support': np.random.choice(['Yes', 'No'], n_customers, p=[0.3, 0.7]),
'streaming_tv': np.random.choice(['Yes', 'No'], n_customers, p=[0.4, 0.6]),
'streaming_movies': np.random.choice(['Yes', 'No'], n_customers, p=[0.4, 0.6]),
'paperless_billing': np.random.choice(['Yes', 'No'], n_customers, p=[0.6, 0.4]),
'senior_citizen': np.random.choice([0, 1], n_customers, p=[0.84, 0.16])
}
self.data = pd.DataFrame(customer_data)
# Create churn target with realistic relationships
churn_prob = 0.1 # Base churn probability
# Adjust probability based on features
prob_adjustments = np.zeros(n_customers)
# Month-to-month contracts have higher churn
prob_adjustments += np.where(self.data['contract_type'] == 'Month-to-month', 0.15, 0)
# High monthly charges increase churn probability
prob_adjustments += np.where(self.data['monthly_charges'] > 80, 0.1, 0)
# Low tenure increases churn probability
prob_adjustments += np.where(self.data['tenure_months'] < 12, 0.12, 0)
# Electronic check payment method increases churn
prob_adjustments += np.where(self.data['payment_method'] == 'Electronic check', 0.08, 0)
# No online security increases churn
prob_adjustments += np.where(self.data['online_security'] == 'No', 0.05, 0)
# Senior citizens have higher churn
prob_adjustments += np.where(self.data['senior_citizen'] == 1, 0.06, 0)
final_churn_prob = np.clip(churn_prob + prob_adjustments, 0, 1)
self.data['churn'] = np.random.binomial(1, final_churn_prob)
print("Dataset created successfully!")
print(f"Shape: {self.data.shape}")
print(f"Churn rate: {self.data['churn'].mean():.1%}")
return self.data
def perform_eda(self):
"""Perform comprehensive exploratory data analysis"""
print("=== EXPLORATORY DATA ANALYSIS ===")
# Basic statistics
print("\n1. Basic Dataset Information:")
print(f" - Dataset shape: {self.data.shape}")
print(f" - Missing values: {self.data.isnull().sum().sum()}")
print(f" - Duplicate rows: {self.data.duplicated().sum()}")
print(f" - Churn rate: {self.data['churn'].mean():.1%}")
# Target variable distribution
print("\n2. Target Variable Distribution:")
churn_counts = self.data['churn'].value_counts()
print(f" - No Churn (0): {churn_counts[0]:,} ({churn_counts[0]/len(self.data):.1%})")
print(f" - Churn (1): {churn_counts[1]:,} ({churn_counts[1]/len(self.data):.1%})")
# Feature analysis
print("\n3. Feature Analysis:")
# Numerical features
numerical_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges']
print("\n Numerical Features Summary:")
for feature in numerical_features:
print(f" - {feature}:")
print(f" Mean: {self.data[feature].mean():.2f}")
print(f" Std: {self.data[feature].std():.2f}")
print(f" Min: {self.data[feature].min():.2f}")
print(f" Max: {self.data[feature].max():.2f}")
# Categorical features
categorical_features = [col for col in self.data.columns
if col not in numerical_features + ['customer_id', 'churn']]
print("\n Categorical Features Summary:")
for feature in categorical_features:
unique_values = self.data[feature].nunique()
print(f" - {feature}: {unique_values} unique values")
top_values = self.data[feature].value_counts().head(3)
print(f" Top values: {dict(top_values)}")
# Correlation with target
print("\n4. Feature-Target Relationships:")
# Numerical features correlation
for feature in numerical_features:
correlation = self.data[feature].corr(self.data['churn'])
print(f" - {feature} correlation with churn: {correlation:.3f}")
# Categorical features churn rates
print("\n Churn rates by categorical features:")
for feature in categorical_features:
churn_by_category = self.data.groupby(feature)['churn'].mean().sort_values(ascending=False)
print(f" - {feature}:")
for category, rate in churn_by_category.items():
print(f" {category}: {rate:.1%}")
return self._create_eda_visualizations()
def _create_eda_visualizations(self):
"""Create comprehensive EDA visualizations"""
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
# 1. Churn distribution
churn_counts = self.data['churn'].value_counts()
axes[0,0].pie(churn_counts.values, labels=['No Churn', 'Churn'], autopct='%1.1f%%')
axes[0,0].set_title('Overall Churn Distribution')
# 2. Age distribution by churn
self.data.boxplot(column='age', by='churn', ax=axes[0,1])
axes[0,1].set_title('Age Distribution by Churn Status')
# 3. Tenure vs Churn
self.data.boxplot(column='tenure_months', by='churn', ax=axes[0,2])
axes[0,2].set_title('Tenure Distribution by Churn Status')
# 4. Monthly charges vs Churn
self.data.boxplot(column='monthly_charges', by='churn', ax=axes[1,0])
axes[1,0].set_title('Monthly Charges by Churn Status')
# 5. Contract type vs Churn
contract_churn = pd.crosstab(self.data['contract_type'], self.data['churn'], normalize='index')
contract_churn.plot(kind='bar', ax=axes[1,1])
axes[1,1].set_title('Churn Rate by Contract Type')
axes[1,1].legend(['No Churn', 'Churn'])
# 6. Payment method vs Churn
payment_churn = pd.crosstab(self.data['payment_method'], self.data['churn'], normalize='index')
payment_churn.plot(kind='bar', ax=axes[1,2])
axes[1,2].set_title('Churn Rate by Payment Method')
axes[1,2].legend(['No Churn', 'Churn'])
# 7. Internet service vs Churn
internet_churn = pd.crosstab(self.data['internet_service'], self.data['churn'], normalize='index')
internet_churn.plot(kind='bar', ax=axes[2,0])
axes[2,0].set_title('Churn Rate by Internet Service')
axes[2,0].legend(['No Churn', 'Churn'])
# 8. Correlation heatmap
numerical_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges', 'senior_citizen', 'churn']
corr_matrix = self.data[numerical_features].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[2,1])
axes[2,1].set_title('Correlation Matrix')
# 9. Feature importance preview (tenure vs monthly charges colored by churn)
churn_0 = self.data[self.data['churn'] == 0]
churn_1 = self.data[self.data['churn'] == 1]
axes[2,2].scatter(churn_0['tenure_months'], churn_0['monthly_charges'],
alpha=0.5, label='No Churn', c='blue')
axes[2,2].scatter(churn_1['tenure_months'], churn_1['monthly_charges'],
alpha=0.5, label='Churn', c='red')
axes[2,2].set_xlabel('Tenure (months)')
axes[2,2].set_ylabel('Monthly Charges')
axes[2,2].set_title('Tenure vs Monthly Charges by Churn')
axes[2,2].legend()
plt.tight_layout()
return fig
def preprocess_data(self):
"""Comprehensive data preprocessing"""
print("=== DATA PREPROCESSING ===")
# Create a copy for preprocessing
df_processed = self.data.copy()
# 1. Handle missing values (if any)
print(f"Missing values before cleaning: {df_processed.isnull().sum().sum()}")
# 2. Feature engineering
print("\n1. Creating new features...")
# Customer value score
df_processed['customer_value_score'] = (
df_processed['tenure_months'] * df_processed['monthly_charges'] /
df_processed['monthly_charges'].max()
)
# Charges per month of tenure
df_processed['charges_per_tenure'] = df_processed['total_charges'] / (df_processed['tenure_months'] + 1)
# Service count
service_features = ['phone_service', 'multiple_lines', 'online_security',
'tech_support', 'streaming_tv', 'streaming_movies']
df_processed['total_services'] = sum([
(df_processed[feature] == 'Yes').astype(int) for feature in service_features
])
# High value customer flag
df_processed['high_value_customer'] = (
(df_processed['monthly_charges'] > df_processed['monthly_charges'].quantile(0.75)) &
(df_processed['tenure_months'] > 12)
).astype(int)
# Contract risk score
contract_risk = {'Month-to-month': 2, 'One year': 1, 'Two year': 0}
df_processed['contract_risk_score'] = df_processed['contract_type'].map(contract_risk)
print(f"New features created: customer_value_score, charges_per_tenure, total_services, high_value_customer, contract_risk_score")
# 3. Encode categorical variables
print("\n2. Encoding categorical variables...")
# Binary categorical variables
binary_features = ['gender', 'phone_service', 'multiple_lines', 'online_security',
'tech_support', 'streaming_tv', 'streaming_movies', 'paperless_billing']
label_encoders = {}
for feature in binary_features:
le = LabelEncoder()
df_processed[f'{feature}_encoded'] = le.fit_transform(df_processed[feature])
label_encoders[feature] = le
# One-hot encode multi-class categorical variables
multi_class_features = ['contract_type', 'payment_method', 'internet_service']
df_encoded = pd.get_dummies(df_processed, columns=multi_class_features, prefix=multi_class_features)
print(f"Encoded features: {binary_features + multi_class_features}")
# 4. Select final features for modeling
feature_columns = []
# Numerical features
numerical_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges',
'senior_citizen', 'customer_value_score', 'charges_per_tenure',
'total_services', 'high_value_customer', 'contract_risk_score']
feature_columns.extend(numerical_features)
# Encoded binary features
encoded_binary_features = [f'{feature}_encoded' for feature in binary_features]
feature_columns.extend(encoded_binary_features)
# One-hot encoded features
onehot_features = [col for col in df_encoded.columns
if any(col.startswith(prefix) for prefix in multi_class_features)]
feature_columns.extend(onehot_features)
# Prepare final dataset
X = df_encoded[feature_columns]
y = df_encoded['churn']
print(f"\nFinal feature set: {len(feature_columns)} features")
print(f"Feature names: {feature_columns}")
# 5. Split the data
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 6. Scale numerical features
print("\n3. Scaling features...")
scaler = StandardScaler()
# Identify numerical columns in the final feature set
numerical_cols_final = [col for col in numerical_features if col in X.columns]
self.X_train_scaled = self.X_train.copy()
self.X_test_scaled = self.X_test.copy()
self.X_train_scaled[numerical_cols_final] = scaler.fit_transform(self.X_train[numerical_cols_final])
self.X_test_scaled[numerical_cols_final] = scaler.transform(self.X_test[numerical_cols_final])
self.preprocessor = {
'scaler': scaler,
'label_encoders': label_encoders,
'feature_columns': feature_columns,
'numerical_columns': numerical_cols_final
}
print(f"Training set: {self.X_train_scaled.shape}")
print(f"Test set: {self.X_test_scaled.shape}")
print(f"Class distribution in training set:")
print(f" No Churn: {(self.y_train == 0).sum()} ({(self.y_train == 0).mean():.1%})")
print(f" Churn: {(self.y_train == 1).sum()} ({(self.y_train == 1).mean():.1%})")
return self.X_train_scaled, self.X_test_scaled, self.y_train, self.y_test
def build_and_evaluate_models(self):
"""Build and evaluate multiple models"""
print("=== MODEL BUILDING AND EVALUATION ===")
# Initialize model selector
model_selector = ModelSelector(problem_type='classification', cv_folds=5, scoring='roc_auc')
# Compare models
print("\n1. Comparing multiple models...")
model_results = model_selector.compare_models(self.X_train_scaled, self.y_train)
# Hyperparameter tuning for best model
print("\n2. Hyperparameter tuning...")
best_model_search = model_selector.hyperparameter_tuning(
self.X_train_scaled, self.y_train,
model_name=None, # Will use best from comparison
search_type='random'
)
self.model = model_selector.best_model
# Evaluate on test set
print("\n3. Final evaluation on test set...")
evaluator = ModelEvaluator(problem_type='classification')
evaluation_results = evaluator.comprehensive_evaluation(
self.model, self.X_test_scaled, self.y_test,
self.X_train_scaled, self.y_train
)
# Print key results
print("\nFinal Model Performance:")
print(f" Accuracy: {evaluation_results['accuracy']:.4f}")
print(f" Precision: {evaluation_results['precision']:.4f}")
print(f" Recall: {evaluation_results['recall']:.4f}")
print(f" F1-Score: {evaluation_results['f1_score']:.4f}")
print(f" ROC-AUC: {evaluation_results['roc_auc']:.4f}")
# Feature importance analysis
if hasattr(self.model, 'feature_importances_'):
feature_importance = pd.DataFrame({
'feature': self.preprocessor['feature_columns'],
'importance': self.model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 Most Important Features:")
for idx, row in feature_importance.head(10).iterrows():
print(f" {row['feature']}: {row['importance']:.4f}")
return evaluation_results
def business_impact_analysis(self):
"""Analyze business impact of the model"""
print("=== BUSINESS IMPACT ANALYSIS ===")
# Business assumptions
business_metrics = {
'avg_customer_value': 1200, # Average annual customer value
'retention_campaign_cost': 50, # Cost per retention campaign
'campaign_success_rate': 0.3, # 30% of targeted customers retained
}
# Get test predictions
y_pred = self.model.predict(self.X_test_scaled)
y_pred_proba = self.model.predict_proba(self.X_test_scaled)[:, 1]
# Calculate confusion matrix components
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(self.y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
# Business impact calculations
print("1. Current Model Performance:")
print(f" - True Positives (Correctly identified churners): {tp}")
print(f" - False Positives (Incorrectly flagged as churners): {fp}")
print(f" - True Negatives (Correctly identified non-churners): {tn}")
print(f" - False Negatives (Missed churners): {fn}")
# Revenue impact
saved_customers = tp * business_metrics['campaign_success_rate']
revenue_saved = saved_customers * business_metrics['avg_customer_value']
campaign_costs = (tp + fp) * business_metrics['retention_campaign_cost']
net_benefit = revenue_saved - campaign_costs
print(f"\n2. Business Impact:")
print(f" - Customers targeted for retention: {tp + fp}")
print(f" - Estimated customers saved: {saved_customers:.0f}")
print(f" - Revenue saved: ${revenue_saved:,.0f}")
print(f" - Campaign costs: ${campaign_costs:,.0f}")
print(f" - Net benefit: ${net_benefit:,.0f}")
# ROI calculation
if campaign_costs > 0:
roi = (revenue_saved - campaign_costs) / campaign_costs * 100
print(f" - Return on Investment: {roi:.1f}%")
# Cost of missed opportunities
missed_revenue = fn * business_metrics['avg_customer_value']
print(f" - Revenue lost from missed churners: ${missed_revenue:,.0f}")
return {
'net_benefit': net_benefit,
'revenue_saved': revenue_saved,
'campaign_costs': campaign_costs,
'customers_saved': saved_customers,
'missed_revenue': missed_revenue
}
# Demonstrate the complete churn prediction project
def run_churn_prediction_project():
"""Run the complete churn prediction project"""
print("CUSTOMER CHURN PREDICTION PROJECT")
print("=" * 50)
# Initialize project
project = ChurnPredictionProject()
# Load and explore data
data = project.load_and_explore_data()
# Perform EDA
eda_fig = project.perform_eda()
# Preprocess data
X_train, X_test, y_train, y_test = project.preprocess_data()
# Build and evaluate models
evaluation_results = project.build_and_evaluate_models()
# Analyze business impact
business_impact = project.business_impact_analysis()
# Create performance visualizations
evaluator = ModelEvaluator(problem_type='classification')
performance_fig = evaluator.visualize_performance(project.model, project.X_test_scaled, project.y_test)
print("\n" + "=" * 50)
print("PROJECT COMPLETE - READY FOR DEPLOYMENT")
return project, evaluation_results, business_impactScenario: A real estate company wants to build an automated valuation model (AVM) to estimate house prices for their online platform and assist real estate agents with pricing strategies.
Business Objectives: - Predict house prices with MAPE < 10% - Provide transparent price estimates to customers - Identify key factors affecting house prices - Support pricing strategy decisions
class HousePricePredictionProject:
def __init__(self):
self.data = None
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
self.model = None
self.preprocessor = None
def generate_realistic_housing_data(self, n_samples=5000):
"""Generate realistic housing dataset"""
np.random.seed(42)
# Location factors (simplified)
neighborhoods = ['Downtown', 'Suburbs', 'Waterfront', 'Industrial', 'Rural']
neighborhood_multipliers = [1.5, 1.2, 1.8, 0.8, 0.9]
# Generate base features
data = {}
# Basic property characteristics
data['square_feet'] = np.random.normal(2000, 600, n_samples).astype(int)
data['square_feet'] = np.clip(data['square_feet'], 500, 5000)
data['bedrooms'] = np.random.choice([1, 2, 3, 4, 5, 6], n_samples,
p=[0.05, 0.15, 0.35, 0.30, 0.12, 0.03])
data['bathrooms'] = np.random.normal(2.5, 1, n_samples)
data['bathrooms'] = np.clip(data['bathrooms'], 1, 5)
# Property features
data['age_years'] = np.random.exponential(20, n_samples).astype(int)
data['age_years'] = np.clip(data['age_years'], 0, 100)
data['garage_spaces'] = np.random.choice([0, 1, 2, 3], n_samples,
p=[0.1, 0.3, 0.5, 0.1])
data['lot_size_sqft'] = np.random.normal(8000, 3000, n_samples).astype(int)
data['lot_size_sqft'] = np.clip(data['lot_size_sqft'], 1000, 20000)
# Categorical features
data['neighborhood'] = np.random.choice(neighborhoods, n_samples)
data['property_type'] = np.random.choice(['Single Family', 'Condo', 'Townhouse'],
n_samples, p=[0.7, 0.2, 0.1])
data['heating_type'] = np.random.choice(['Gas', 'Electric', 'Oil'], n_samples,
p=[0.6, 0.3, 0.1])
# Quality ratings (1-10 scale)
data['overall_condition'] = np.random.choice(range(1, 11), n_samples,
p=[0.02, 0.03, 0.05, 0.1, 0.15,
0.25, 0.2, 0.12, 0.06, 0.02])
data['kitchen_quality'] = np.random.choice(range(1, 11), n_samples,
p=[0.05, 0.05, 0.1, 0.15, 0.2,
0.2, 0.15, 0.07, 0.02, 0.01])
# Binary features
data['has_pool'] = np.random.choice([0, 1], n_samples, p=[0.8, 0.2])
data['has_fireplace'] = np.random.choice([0, 1], n_samples, p=[0.6, 0.4])
data['has_basement'] = np.random.choice([0, 1], n_samples, p=[0.3, 0.7])
data['recently_renovated'] = np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
# Create DataFrame
df = pd.DataFrame(data)
# Calculate realistic price based on features
base_price = 100000 # Base price
# Square footage impact (most important factor)
price = base_price + (df['square_feet'] * 120);
# Neighborhood multiplier
neighborhood_mult = df['neighborhood'].map(dict(zip(neighborhoods, neighborhood_multipliers)));
price = price * neighborhood_mult;
# Bedrooms and bathrooms
price += df['bedrooms'] * 15000;
price += df['bathrooms'] * 10000;
# Age depreciation
price *= (1 - df['age_years'] * 0.005);
# Quality factors
price *= (0.7 + df['overall_condition'] * 0.03);
price *= (0.9 + df['kitchen_quality'] * 0.01);
# Property type adjustments
type_multipliers = {'Single Family': 1.0, 'Condo': 0.85, 'Townhouse': 0.92}
type_mult = df['property_type'].map(type_multipliers);
price *= type_mult;
# Additional features
price += df['garage_spaces'] * 8000;
price += df['lot_size_sqft'] * 2;
price += df['has_pool'] * 25000;
price += df['has_fireplace'] * 8000;
price += df['has_basement'] * 12000;
price += df['recently_renovated'] * 20000;
# Add some noise
noise = np.random.normal(0, price * 0.1);
price += noise;
# Ensure positive prices
price = np.maximum(price, 50000);
df['price'] = price.astype(int)
self.data = df
return df
def perform_eda_regression(self):
"""Perform EDA for regression problem"""
print("=== HOUSING DATA EXPLORATORY ANALYSIS ===")
print(f"\n1. Dataset Overview:")
print(f" - Shape: {self.data.shape}")
print(f" - Missing values: {self.data.isnull().sum().sum()}")
print(f" - Price range: ${self.data['price'].min():,} - ${self.data['price'].max():,}")
print(f" - Median price: ${self.data['price'].median():,}")
print(f" - Mean price: ${self.data['price'].mean():,.0f}")
# Numerical features analysis
numerical_features = ['square_feet', 'bedrooms', 'bathrooms', 'age_years',
'garage_spaces', 'lot_size_sqft', 'overall_condition',
'kitchen_quality', 'price']
print(f"\n2. Numerical Features Summary:")
print(self.data[numerical_features].describe())
# Correlation with price
print(f"\n3. Correlation with Price:")
price_correlations = self.data[numerical_features].corr()['price'].sort_values(ascending=False)
for feature, corr in price_correlations.items():
if feature != 'price':
print(f" - {feature}: {corr:.3f}")
# Categorical features analysis
categorical_features = ['neighborhood', 'property_type', 'heating_type']
print(f"\n4. Price by Categories:")
for feature in categorical_features:
print(f"\n {feature}:")
price_by_category = self.data.groupby(feature)['price'].agg(['mean', 'median', 'count'])
for category in price_by_category.index:
mean_price = price_by_category.loc[category, 'mean']
median_price = price_by_category.loc[category, 'median']
count = price_by_category.loc[category, 'count']
print(f" {category}: Mean=${mean_price:,.0f}, Median=${median_price:,.0f}, Count={count}")
return self._create_regression_eda_plots()
def _create_regression_eda_plots(self):
"""Create EDA visualizations for regression"""
fig, axes = plt.subplots(3, 3, figsize=(20, 18))
# 1. Price distribution
axes[0,0].hist(self.data['price'], bins=50, alpha=0.7, edgecolor='black')
axes[0,0].set_title('House Price Distribution')
axes[0,0].set_xlabel('Price ($)')
axes[0,0].set_ylabel('Frequency')
# 2. Log price distribution (often more normal)
log_price = np.log(self.data['price'])
axes[0,1].hist(log_price, bins=50, alpha=0.7, edgecolor='black')
axes[0,1].set_title('Log Price Distribution')
axes[0,1].set_xlabel('Log(Price)')
axes[0,1].set_ylabel('Frequency')
# 3. Square feet vs Price
axes[0,2].scatter(self.data['square_feet'], self.data['price'], alpha=0.5)
axes[0,2].set_xlabel('Square Feet')
axes[0,2].set_ylabel('Price ($)')
axes[0,2].set_title('Square Feet vs Price')
# 4. Age vs Price
axes[1,0].scatter(self.data['age_years'], self.data['price'], alpha=0.5)
axes[1,0].set_xlabel('Age (Years)')
axes[1,0].set_ylabel('Price ($)')
axes[1,0].set_title('Age vs Price')
# 5. Bedrooms vs Price (box plot)
self.data.boxplot(column='price', by='bedrooms', ax=axes[1,1])
axes[1,1].set_title('Price Distribution by Bedrooms')
axes[1,1].set_xlabel('Number of Bedrooms')
# 6. Neighborhood vs Price
neighborhood_prices = self.data.groupby('neighborhood')['price'].mean().sort_values()
axes[1,2].bar(neighborhood_prices.index, neighborhood_prices.values)
axes[1,2].set_title('Average Price by Neighborhood')
axes[1,2].set_xlabel('Neighborhood')
axes[1,2].set_ylabel('Average Price ($)')
axes[1,2].tick_params(axis='x', rotation=45)
# 7. Overall condition vs Price
self.data.boxplot(column='price', by='overall_condition', ax=axes[2,0])
axes[2,0].set_title('Price by Overall Condition')
axes[2,0].set_xlabel('Overall Condition (1-10)')
# 8. Correlation heatmap
numerical_cols = ['square_feet', 'bedrooms', 'bathrooms', 'age_years',
'garage_spaces', 'overall_condition', 'kitchen_quality', 'price']
corr_matrix = self.data[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[2,1])
axes[2,1].set_title('Feature Correlation Matrix')
# 9. Price vs. Multiple features
# Create a composite score and plot against price
self.data['quality_score'] = (self.data['overall_condition'] + self.data['kitchen_quality']) / 2
axes[2,2].scatter(self.data['quality_score'], self.data['price'], alpha=0.5)
axes[2,2].set_xlabel('Average Quality Score')
axes[2,2].set_ylabel('Price ($)')
axes[2,2].set_title('Quality Score vs Price')
plt.tight_layout()
return fig
def preprocess_regression_data(self):
"""Preprocess data for regression modeling"""
print("=== REGRESSION DATA PREPROCESSING ===")
# Feature engineering for regression
df_processed = self.data.copy()
# 1. Create new features
print("\n1. Feature Engineering:")
# Price per square foot (for analysis, not modeling)
df_processed['price_per_sqft'] = df_processed['price'] / df_processed['square_feet']
# Total rooms
df_processed['total_rooms'] = df_processed['bedrooms'] + df_processed['bathrooms']
# Property age categories
df_processed['age_category'] = pd.cut(df_processed['age_years'],
bins=[0, 5, 15, 30, 100],
labels=['New', 'Recent', 'Mature', 'Old'])
# Size categories
df_processed['size_category'] = pd.cut(df_processed['square_feet'],
bins=[0, 1200, 2000, 3000, 10000],
labels=['Small', 'Medium', 'Large', 'Luxury'])
# Quality score
df_processed['quality_score'] = (df_processed['overall_condition'] + df_processed['kitchen_quality']) / 2
# Lot efficiency (house size relative to lot size)
df_processed['lot_efficiency'] = df_processed['square_feet'] / df_processed['lot_size_sqft']
print(f" Created features: total_rooms, age_category, size_category, quality_score, lot_efficiency")
# 2. Handle categorical variables
print("\n2. Encoding categorical variables...")
# One-hot encode categorical variables
categorical_features = ['neighborhood', 'property_type', 'heating_type', 'age_category', 'size_category']
df_encoded = pd.get_dummies(df_processed, columns=categorical_features, prefix=categorical_features)
# 3. Select features for modeling
# Exclude target and intermediate variables
exclude_features = ['price', 'price_per_sqft']
feature_columns = [col for col in df_encoded.columns if col not in exclude_features]
X = df_encoded[feature_columns]
y = df_encoded['price']
print(f"\nFeatures for modeling: {len(feature_columns)} features")
# 4. Train-test split
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 5. Feature scaling
print("\n3. Feature scaling...")
scaler = StandardScaler()
# Scale all features for regression
self.X_train_scaled = pd.DataFrame(
scaler.fit_transform(self.X_train),
columns=self.X_train.columns,
index=self.X_train.index
)
self.X_test_scaled = pd.DataFrame(
scaler.transform(self.X_test),
columns=self.X_test.columns,
index=self.X_test.index
)
self.preprocessor = {
'scaler': scaler,
'feature_columns': feature_columns
}
print(f"Training set: {self.X_train_scaled.shape}")
print(f"Test set: {self.X_test_scaled.shape}")
print(f"Target variable statistics:")
print(f" Training mean: ${self.y_train.mean():,.0f}")
print(f" Training std: ${self.y_train.std():,.0f}")
print(f" Training range: ${self.y_train.min():,.0f} - ${self.y_train.max():,.0f}")
return self.X_train_scaled, self.X_test_scaled, self.y_train, self.y_test
def build_regression_models(self):
"""Build and evaluate regression models"""
print("=== REGRESSION MODEL BUILDING ===")
# Initialize model selector for regression
model_selector = ModelSelector(problem_type='regression', cv_folds=5, scoring='r2')
# Compare models
print("\n1. Comparing regression models...")
model_results = model_selector.compare_models(self.X_train_scaled, self.y_train)
# Hyperparameter tuning
print("\n2. Hyperparameter tuning...")
best_model_search = model_selector.hyperparameter_tuning(
self.X_train_scaled, self.y_train,
model_name=None,
search_type='random'
)
self.model = model_selector.best_model
# Comprehensive evaluation
print("\n3. Model evaluation...")
evaluator = ModelEvaluator(problem_type='regression')
evaluation_results = evaluator.comprehensive_evaluation(
self.model, self.X_test_scaled, self.y_test,
self.X_train_scaled, self.y_train
)
# Calculate additional metrics
y_pred = self.model.predict(self.X_test_scaled)
# Mean Absolute Percentage Error
mape = np.mean(np.abs((self.y_test - y_pred) / self.y_test)) * 100
# Within percentage thresholds
errors = np.abs(self.y_test - y_pred) / self.y_test
within_5_pct = (errors <= 0.05).mean() * 100
within_10_pct = (errors <= 0.10).mean() * 100
within_20_pct = (errors <= 0.20).mean() * 100
print(f"\nRegression Model Performance:")
print(f" R² Score: {evaluation_results['r2_score']:.4f}")
print(f" RMSE: ${evaluation_results['rmse']:,.0f}")
print(f" MAE: ${evaluation_results['mae']:,.0f}")
print(f" MAPE: {mape:.2f}%")
print(f" Predictions within 5%: {within_5_pct:.1f}%")
print(f" Predictions within 10%: {within_10_pct:.1f}%")
print(f" Predictions within 20%: {within_20_pct:.1f}%")
evaluation_results['mape'] = mape
evaluation_results['within_5_pct'] = within_5_pct
evaluation_results['within_10_pct'] = within_10_pct
evaluation_results['within_20_pct'] = within_20_pct
return evaluation_resultsScenario: An e-commerce company wants to segment their customers for targeted marketing campaigns, personalized recommendations, and inventory planning.
class CustomerSegmentationProject:
def __init__(self):
self.data = None
self.customer_features = None
self.segmentation_model = None
self.segment_profiles = {}
def generate_ecommerce_data(self, n_customers=10000):
"""Generate realistic e-commerce customer data"""
np.random.seed(42)
# Customer demographics
data = {
'customer_id': range(1, n_customers + 1),
'age': np.random.normal(40, 15, n_customers).astype(int),
'registration_days': np.random.exponential(365, n_customers).astype(int)
}
# Clip age to reasonable range
data['age'] = np.clip(data['age'], 18, 80)
# Purchase behavior (with realistic correlations)
# Create different customer archetypes
customer_types = np.random.choice(['bargain_hunter', 'premium', 'occasional', 'frequent'],
n_customers, p=[0.3, 0.2, 0.3, 0.2])
# Initialize arrays
total_orders = np.zeros(n_customers)
total_spent = np.zeros(n_customers)
avg_order_value = np.zeros(n_customers)
days_since_last_order = np.zeros(n_customers)
for i in range(n_customers):
if customer_types[i] == 'bargain_hunter':
total_orders[i] = np.random.poisson(15)
avg_order_value[i] = np.random.normal(25, 8)
days_since_last_order[i] = np.random.exponential(30)
elif customer_types[i] == 'premium':
total_orders[i] = np.random.poisson(8)
avg_order_value[i] = np.random.normal(150, 50)
days_since_last_order[i] = np.random.exponential(45)
elif customer_types[i] == 'occasional':
total_orders[i] = np.random.poisson(3)
avg_order_value[i] = np.random.normal(60, 20)
days_since_last_order[i] = np.random.exponential(90)
else: # frequent
total_orders[i] = np.random.poisson(25)
avg_order_value[i] = np.random.normal(80, 25)
days_since_last_order[i] = np.random.exponential(15)
# Ensure positive values
avg_order_value = np.maximum(avg_order_value, 10)
total_spent = total_orders * avg_order_value
days_since_last_order = np.maximum(days_since_last_order, 1)
data.update({
'total_orders': total_orders.astype(int),
'total_spent': total_spent,
'avg_order_value': avg_order_value,
'days_since_last_order': days_since_last_order.astype(int)
})
# Category preferences
categories = ['electronics', 'clothing', 'home', 'books', 'sports']
for category in categories:
data[f'{category}_orders'] = np.random.poisson(data['total_orders'] * np.random.uniform(0.1, 0.4, n_customers))
# Engagement metrics
data['website_visits'] = np.random.poisson(data['total_orders'] * np.random.uniform(3, 8, n_customers))
data['email_opens'] = np.random.binomial(50, 0.3, n_customers) # Assumes 50 emails sent
data['social_media_clicks'] = np.random.poisson(5, n_customers)
# Channel preferences
data['mobile_orders'] = np.random.binomial(data['total_orders'], 0.6)
data['desktop_orders'] = data['total_orders'] - data['mobile_orders']
self.data = pd.DataFrame(data)
# Add true customer type for validation (normally wouldn't have this)
self.data['true_segment'] = customer_types
return self.data
def feature_engineering_unsupervised(self):
"""Create features for customer segmentation"""
print("=== FEATURE ENGINEERING FOR SEGMENTATION ===")
df = self.data.copy()
# 1. RFM Features (Recency, Frequency, Monetary)
df['recency'] = df['days_since_last_order']
df['frequency'] = df['total_orders']
df['monetary'] = df['total_spent']
# 2. Behavioral features
df['orders_per_day'] = df['total_orders'] / np.maximum(df['registration_days'], 1)
df['avg_days_between_orders'] = df['registration_days'] / np.maximum(df['total_orders'], 1)
# 3. Category diversity
category_columns = ['electronics_orders', 'clothing_orders', 'home_orders', 'books_orders', 'sports_orders']
df['category_diversity'] = (df[category_columns] > 0).sum(axis=1)
# 4. Engagement metrics
df['engagement_score'] = (
(df['website_visits'] / np.maximum(df['total_orders'], 1)) * 0.3 +
(df['email_opens'] / 50) * 0.4 + # Normalize by emails sent
(df['social_media_clicks'] / 10) * 0.3 # Normalize
)
# 5. Channel preference
df['mobile_preference'] = df['mobile_orders'] / np.maximum(df['total_orders'], 1)
# 6. Customer lifetime value approximation
df['customer_lifetime_value'] = df['avg_order_value'] * df['total_orders']
# Select features for clustering
clustering_features = [
'recency', 'frequency', 'monetary', 'avg_order_value',
'orders_per_day', 'category_diversity', 'engagement_score',
'mobile_preference', 'age'
]
self.customer_features = df[clustering_features]
print(f"Features for clustering: {clustering_features}")
print(f"Feature matrix shape: {self.customer_features.shape}")
# Display feature statistics
print(f"\nFeature Statistics:")
print(self.customer_features.describe())
return self.customer_features
def perform_customer_segmentation(self, n_clusters_range=range(2, 10)):
"""Perform customer segmentation using multiple methods"""
print("=== CUSTOMER SEGMENTATION ===")
# Scale features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(self.customer_features)
features_scaled_df = pd.DataFrame(features_scaled,
columns=self.customer_features.columns,
index=self.customer_features.index)
# 1. Determine optimal number of clusters
print("\n1. Determining optimal number of clusters...")
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
inertias = []
silhouette_scores = []
for n_clusters in n_clusters_range:
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(features_scaled)
inertias.append(kmeans.inertia_)
if n_clusters > 1:
silhouette_avg = silhouette_score(features_scaled, cluster_labels)
silhouette_scores.append(silhouette_avg)
print(f" {n_clusters} clusters: Inertia={kmeans.inertia_:.0f}, "
f"Silhouette={silhouette_score(features_scaled, cluster_labels):.3f}")
# Choose optimal number of clusters (highest silhouette score)
optimal_clusters = n_clusters_range[np.argmax(silhouette_scores) + 1] # +1 because silhouette starts from 2
print(f"\nOptimal number of clusters: {optimal_clusters} (highest silhouette score)")
# 2. Final clustering
print(f"\n2. Performing final clustering with {optimal_clusters} clusters...")
self.segmentation_model = KMeans(n_clusters=optimal_clusters, random_state=42, n_init=10)
cluster_labels = self.segmentation_model.fit_predict(features_scaled)
# Add cluster labels to data
self.data['cluster'] = cluster_labels
self.customer_features['cluster'] = cluster_labels
# 3. Profile segments
print(f"\n3. Creating segment profiles...")
self.segment_profiles = self._create_segment_profiles()
# 4. Validation against true segments (if available)
if 'true_segment' in self.data.columns:
self._validate_clustering()
return cluster_labels, self.segment_profiles
def _create_segment_profiles(self):
"""Create detailed profiles for each segment"""
profiles = {}
for cluster_id in sorted(self.data['cluster'].unique()):
cluster_data = self.data[self.data['cluster'] == cluster_id]
profile = {
'size': len(cluster_data),
'percentage': len(cluster_data) / len(self.data) * 100,
'demographics': {
'avg_age': cluster_data['age'].mean(),
'avg_registration_days': cluster_data['registration_days'].mean()
},
'rfm': {
'avg_recency': cluster_data['days_since_last_order'].mean(),
'avg_frequency': cluster_data['total_orders'].mean(),
'avg_monetary': cluster_data['total_spent'].mean()
},
'behavior': {
'avg_order_value': cluster_data['avg_order_value'].mean(),
'category_diversity': cluster_data[['electronics_orders', 'clothing_orders',
'home_orders', 'books_orders', 'sports_orders']].mean(),
'mobile_preference': (cluster_data['mobile_orders'] /
np.maximum(cluster_data['total_orders'], 1)).mean(),
'engagement_score': (cluster_data['email_opens'] / 50).mean()
}
}
profiles[f'Segment_{cluster_id}'] = profile
# Print profile
print(f"\n Segment {cluster_id} ({profile['size']:,} customers, {profile['percentage']:.1f}%):")
print(f" Demographics: Age={profile['demographics']['avg_age']:.1f}, "
f"Days registered={profile['demographics']['avg_registration_days']:.0f}")
print(f" RFM: Recency={profile['rfm']['avg_recency']:.0f} days, "
f"Frequency={profile['rfm']['avg_frequency']:.1f} orders, "
f"Monetary=${profile['rfm']['avg_monetary']:.0f}")
print(f" Behavior: AOV=${profile['behavior']['avg_order_value']:.0f}, "
f"Mobile pref={profile['behavior']['mobile_preference']:.1%}")
return profiles
def recommend_marketing_strategies(self):
"""Recommend marketing strategies for each segment"""
print("\n=== MARKETING STRATEGY RECOMMENDATIONS ===")
strategies = {}
for segment_name, profile in self.segment_profiles.items():
cluster_id = segment_name.split('_')[1]
# Analyze segment characteristics
high_value = profile['rfm']['avg_monetary'] > self.data['total_spent'].median()
frequent_buyer = profile['rfm']['avg_frequency'] > self.data['total_orders'].median()
recent_activity = profile['rfm']['avg_recency'] < self.data['days_since_last_order'].median()
high_aov = profile['behavior']['avg_order_value'] > self.data['avg_order_value'].median()
# Generate strategy recommendations
recommendations = []
if high_value and frequent_buyer:
recommendations.extend([
"VIP/Premium loyalty program",
"Early access to new products",
"Personalized shopping experiences",
"High-value product recommendations"
])
if not recent_activity:
recommendations.extend([
"Re-engagement campaigns",
"Win-back offers with discounts",
"Reminder emails about abandoned carts",
"Survey to understand satisfaction issues"
])
if frequent_buyer and not high_value:
recommendations.extend([
"Upselling campaigns",
"Bundle offers",
"Category expansion recommendations",
"Volume discount programs"
])
if high_aov and not frequent_buyer:
recommendations.extend([
"Frequency-building campaigns",
"Subscription/repeat purchase incentives",
"Cross-selling based on purchase history",
"Seasonal reminders"
])
if profile['behavior']['mobile_preference'] > 0.7:
recommendations.append("Mobile-optimized campaigns and app notifications")
else:
recommendations.append("Email and desktop-focused campaigns")
strategies[segment_name] = {
'characteristics': {
'high_value': high_value,
'frequent_buyer': frequent_buyer,
'recent_activity': recent_activity,
'high_aov': high_aov
},
'recommendations': recommendations
}
print(f"\n{segment_name} Strategy:")
for rec in recommendations:
print(f" ✅ {rec}")
return strategies
def run_customer_segmentation_project():
"""Run complete customer segmentation project"""
print("CUSTOMER SEGMENTATION PROJECT")
print("=" * 50)
project = CustomerSegmentationProject()
# Generate data
data = project.generate_ecommerce_data(n_customers=5000)
# Feature engineering
features = project.feature_engineering_unsupervised()
# Perform segmentation
clusters, profiles = project.perform_customer_segmentation()
# Generate marketing recommendations
strategies = project.recommend_marketing_strategies()
return project, profiles, strategiesScenario: A financial services company needs to detect fraudulent transactions in real-time to minimize financial losses while maintaining customer satisfaction.
class FraudDetectionProject:
def __init__(self):
self.data = None
self.X_train = None
self.X_test = None
self.y_train = None
self.y_test = None
self.models = {}
self.evaluation_results = {}
def generate_fraud_dataset(self, n_transactions=100000, fraud_rate=0.02):
"""Generate realistic fraud detection dataset"""
np.random.seed(42)
# Transaction features
data = {
'transaction_id': range(1, n_transactions + 1),
'amount': np.random.lognormal(3, 1.5, n_transactions), # Log-normal distribution for amounts
'hour': np.random.randint(0, 24, n_transactions),
'day_of_week': np.random.randint(0, 7, n_transactions),
'merchant_category': np.random.choice(['grocery', 'gas', 'restaurant', 'retail', 'online', 'atm'],
n_transactions, p=[0.25, 0.15, 0.20, 0.15, 0.20, 0.05]),
'transaction_type': np.random.choice(['purchase', 'withdrawal', 'transfer'],
n_transactions, p=[0.7, 0.2, 0.1])
}
# Customer behavior features
data['customer_age'] = np.random.normal(45, 15, n_transactions).astype(int)
data['customer_age'] = np.clip(data['customer_age'], 18, 90)
data['account_age_days'] = np.random.exponential(1000, n_transactions).astype(int)
data['transactions_last_30_days'] = np.random.poisson(20, n_transactions)
data['avg_transaction_amount'] = np.random.lognormal(2.5, 1, n_transactions)
# Location and device features
data['same_city_last_transaction'] = np.random.choice([0, 1], n_transactions, p=[0.1, 0.9])
data['same_device'] = np.random.choice([0, 1], n_transactions, p=[0.05, 0.95])
data['international_transaction'] = np.random.choice([0, 1], n_transactions, p=[0.95, 0.05])
# Time-based features
data['weekend'] = (data['day_of_week'] >= 5).astype(int)
data['night_time'] = ((data['hour'] >= 22) | (data['hour'] <= 6)).astype(int)
df = pd.DataFrame(data)
# Create realistic fraud patterns
fraud_probability = np.full(n_transactions, 0.001) # Base fraud rate
# High-risk patterns increase fraud probability
fraud_probability += np.where(df['amount'] > df['amount'].quantile(0.95), 0.15, 0) # Very high amounts
fraud_probability += np.where(df['night_time'] == 1, 0.02, 0) # Night transactions
fraud_probability += np.where(df['international_transaction'] == 1, 0.08, 0) # International
fraud_probability += np.where(df['same_device'] == 0, 0.05, 0) # Different device
fraud_probability += np.where(df['same_city_last_transaction'] == 0, 0.03, 0) # Different location
fraud_probability += np.where(df['merchant_category'] == 'atm', 0.03, 0) # ATM transactions
fraud_probability += np.where(df['account_age_days'] < 30, 0.04, 0) # New accounts
# Amount vs. customer history
amount_ratio = df['amount'] / df['avg_transaction_amount']
fraud_probability += np.where(amount_ratio > 5, 0.1, 0) # Much larger than usual
# Multiple transactions in short time (velocity)
fraud_probability += np.where(df['transactions_last_30_days'] > 50, 0.03, 0)
# Clip probabilities
fraud_probability = np.clip(fraud_probability, 0, 0.5)
# Generate fraud labels
df['is_fraud'] = np.random.binomial(1, fraud_probability)
# Adjust to target fraud rate
actual_fraud_rate = df['is_fraud'].mean()
if actual_fraud_rate > fraud_rate:
# Randomly convert some frauds to normal
fraud_indices = df[df['is_fraud'] == 1].index
n_to_convert = int((actual_fraud_rate - fraud_rate) * len(df))
convert_indices = np.random.choice(fraud_indices, size=min(n_to_convert, len(fraud_indices)), replace=False)
df.loc[convert_indices, 'is_fraud'] = 0
self.data = df
print(f"Dataset created:")
print(f" Total transactions: {len(df):,}")
print(f" Fraud transactions: {df['is_fraud'].sum():,} ({df['is_fraud'].mean():.2%})")
print(f" Normal transactions: {(df['is_fraud'] == 0).sum():,}")
return df
def analyze_fraud_patterns(self):
"""Analyze fraud patterns in the data"""
print("=== FRAUD PATTERN ANALYSIS ===")
fraud_data = self.data[self.data['is_fraud'] == 1]
normal_data = self.data[self.data['is_fraud'] == 0]
print(f"\n1. Transaction Amount Analysis:")
print(f" Fraud transactions - Mean: ${fraud_data['amount'].mean():.2f}, Median: ${fraud_data['amount'].median():.2f}")
print(f" Normal transactions - Mean: ${normal_data['amount'].mean():.2f}, Median: ${normal_data['amount'].median():.2f}")
print(f"\n2. Timing Patterns:")
fraud_night_pct = (fraud_data['night_time'] == 1).mean() * 100
normal_night_pct = (normal_data['night_time'] == 1).mean() * 100
print(f" Night-time transactions - Fraud: {fraud_night_pct:.1f}%, Normal: {normal_night_pct:.1f}%")
fraud_weekend_pct = (fraud_data['weekend'] == 1).mean() * 100
normal_weekend_pct = (normal_data['weekend'] == 1).mean() * 100
print(f" Weekend transactions - Fraud: {fraud_weekend_pct:.1f}%, Normal: {normal_weekend_pct:.1f}%")
print(f"\n3. Location and Device Patterns:")
fraud_intl_pct = (fraud_data['international_transaction'] == 1).mean() * 100
normal_intl_pct = (normal_data['international_transaction'] == 1).mean() * 100
print(f" International - Fraud: {fraud_intl_pct:.1f}%, Normal: {normal_intl_pct:.1f}%")
fraud_diff_device_pct = (fraud_data['same_device'] == 0).mean() * 100
normal_diff_device_pct = (normal_data['same_device'] == 0).mean() * 100
print(f" Different device - Fraud: {fraud_diff_device_pct:.1f}%, Normal: {normal_diff_device_pct:.1f}%")
return self._create_fraud_analysis_plots()
def _create_fraud_analysis_plots(self):
"""Create visualizations for fraud analysis"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Fraud distribution by amount
sns.histplot(self.data[self.data['is_fraud'] == 1]['amount'], bins=50, kde=True, ax=axes[0,0])
axes[0,0].set_title('Fraudulent Transactions Amount Distribution')
axes[0,0].set_xlabel('Amount')
axes[0,0].set_ylabel('Frequency')
# 2. Transaction time analysis
fraud_times = self.data[self.data['is_fraud'] == 1]['hour']
sns.histplot(fraud_times, bins=24, kde=True, ax=axes[0,1])
axes[0,1].set_title('Fraudulent Transactions by Hour of Day')
axes[0,1].set_xlabel('Hour of Day')
axes[0,1].set_ylabel('Frequency')
# 3. Day of week analysis
fraud_days = self.data[self.data['is_fraud'] == 1]['day_of_week']
sns.histplot(fraud_days, bins=7, kde=True, ax=axes[1,0])
axes[1,0].set_title('Fraudulent Transactions by Day of Week')
axes[1,0].set_xlabel('Day of Week')
axes[1,0].set_ylabel('Frequency')
# 4. Correlation heatmap for fraud data
sns.heatmap(self.data.corr(), annot=True, cmap='coolwarm', ax=axes[1,1])
axes[1,1].set_title('Feature Correlation Matrix (Fraud Data)')
plt.tight_layout()
return fig
def handle_class_imbalance(self):
"""Implement techniques to handle class imbalance"""
print("=== HANDLING CLASS IMBALANCE ===")
# Prepare features
feature_columns = [col for col in self.data.columns
if col not in ['transaction_id', 'is_fraud']]
# Encode categorical variables
df_encoded = pd.get_dummies(self.data[feature_columns + ['is_fraud']],
columns=['merchant_category', 'transaction_type'])
X = df_encoded.drop('is_fraud', axis=1)
y = df_encoded['is_fraud']
# Train-test split
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Original class distribution:")
print(f" Training: {self.y_train.value_counts().to_dict()}")
print(f" Testing: {self.y_test.value_counts().to_dict()}")
# Scale features
scaler = StandardScaler()
self.X_train_scaled = scaler.fit_transform(self.X_train)
self.X_test_scaled = scaler.transform(self.X_test)
# Implement different sampling strategies
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek
sampling_strategies = {
'original': (self.X_train_scaled, self.y_train),
'smote': SMOTE(random_state=42),
'adasyn': ADASYN(random_state=42),
'undersampling': RandomUnderSampler(random_state=42),
'smote_tomek': SMOTETomek(random_state=42)
}
self.resampled_datasets = {}
for strategy_name, strategy in sampling_strategies.items():
if strategy_name == 'original':
self.resampled_datasets[strategy_name] = strategy
else:
X_resampled, y_resampled = strategy.fit_resample(self.X_train_scaled, self.y_train)
self.resampled_datasets[strategy_name] = (X_resampled, y_resampled)
print(f"\n{strategy_name.upper()} resampling:")
print(f" New shape: {X_resampled.shape}")
print(f" Class distribution: {pd.Series(y_resampled).value_counts().to_dict()}")
return self.resampled_datasets
def train_fraud_models(self):
"""Train models with different sampling strategies and algorithms"""
print("=== TRAINING FRAUD DETECTION MODELS ===")
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, average_precision_score
import xgboost as xgb
# Models to test
models = {
'logistic_regression': LogisticRegression(random_state=42, class_weight='balanced'),
'random_forest': RandomForestClassifier(random_state=42, class_weight='balanced'),
'xgboost': xgb.XGBClassifier(random_state=42, scale_pos_weight=10), # Handle imbalance
'gradient_boosting': GradientBoostingClassifier(random_state=42)
}
# Evaluation metrics for imbalanced datasets
def evaluate_fraud_model(model, X_test, y_test, model_name, sampling_strategy):
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, precision_recall_fscore_support,
average_precision_score)
# Standard metrics
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')
roc_auc = roc_auc_score(y_test, y_pred_proba)
pr_auc = average_precision_score(y_test, y_pred_proba)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
# Business metrics
cost_per_fraud_missed = 1000 # Average loss per fraud
cost_per_false_alarm = 10 # Cost to investigate false positive
total_cost = (fn * cost_per_fraud_missed) + (fp * cost_per_false_alarm)
return {
'model': model_name,
'sampling_strategy': sampling_strategy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'roc_auc': roc_auc,
'pr_auc': pr_auc,
'true_positives': tp,
'false_positives': fp,
'true_negatives': tn,
'false_negatives': fn,
'total_cost': total_cost
}
# Train and evaluate all combinations
results = []
for sampling_name, (X_train_resampled, y_train_resampled) in self.resampled_datasets.items():
print(f"\nTesting sampling strategy: {sampling_name}")
for model_name, model in models.items():
try:
# Train model
model.fit(X_train_resampled, y_train_resampled)
# Evaluate
result = evaluate_fraud_model(model, self.X_test_scaled, self.y_test,
model_name, sampling_name)
results.append(result)
print(f" {model_name}: Precision={result['precision']:.3f}, "
f"Recall={result['recall']:.3f}, PR-AUC={result['pr_auc']:.3f}")
except Exception as e:
print(f" {model_name}: Error - {str(e)}")
# Convert to DataFrame for analysis
self.evaluation_results = pd.DataFrame(results)
# Find best model based on PR-AUC (better for imbalanced datasets)
best_model_idx = self.evaluation_results['pr_auc'].idxmax()
best_result = self.evaluation_results.iloc[best_model_idx]
print(f"\n=== BEST MODEL PERFORMANCE ===")
print(f"Model: {best_result['model']} with {best_result['sampling_strategy']}")
print(f"Precision: {best_result['precision']:.3f}")
print(f"Recall: {best_result['recall']:.3f}")
print(f"F1-Score: {best_result['f1_score']:.3f}")
print(f"ROC-AUC: {best_result['roc_auc']:.3f}")
print(f"PR-AUC: {best_result['pr_auc']:.3f}")
print(f"Business cost: ${best_result['total_cost']:,.0f}")
return self.evaluation_results
def run_fraud_detection_project():
"""Run complete fraud detection project"""
print("FRAUD DETECTION PROJECT")
print("=" * 50)
project = FraudDetectionProject()
# Generate data
data = project.generate_fraud_dataset(n_transactions=50000, fraud_rate=0.02)
# Analyze patterns
analysis_fig = project.analyze_fraud_patterns()
# Handle imbalance
resampled_datasets = project.handle_class_imbalance()
# Train models
results = project.train_fraud_models()
return project, resultsObjective: Build a complete ML pipeline from data ingestion to model deployment.
# Lab 1: Complete ML Pipeline
def lab_ml_pipeline():
"""
Lab Exercise: Build a complete ML pipeline for predicting employee attrition
Tasks:
1. Data loading and initial exploration
2. Data preprocessing and feature engineering
3. Model selection and hyperparameter tuning
4. Model evaluation and interpretation
5. Deployment preparation
"""
print("LAB 1: COMPLETE ML PIPELINE")
print("=" * 40)
# Step 1: Generate employee attrition data
np.random.seed(42)
n_employees = 2000
employee_data = {
'employee_id': range(1, n_employees + 1),
'age': np.random.normal(35, 8, n_employees).astype(int),
'years_at_company': np.random.exponential(5, n_employees).astype(int),
'salary': np.random.normal(70000, 20000, n_employees),
'satisfaction_score': np.random.uniform(1, 10, n_employees),
'performance_rating': np.random.choice([1, 2, 3, 4, 5], n_employees, p=[0.05, 0.15, 0.6, 0.15, 0.05]),
'department': np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR', 'Finance'],
n_employees, p=[0.4, 0.2, 0.15, 0.15, 0.1]),
'remote_work': np.random.choice([0, 1], n_employees, p=[0.7, 0.3]),
'overtime_hours': np.random.exponential(5, n_employees),
'commute_distance': np.random.exponential(10, n_employees)
}
df = pd.DataFrame(employee_data)
# Create realistic attrition patterns
attrition_prob = 0.1 # Base probability
# Factors increasing attrition
prob_adjustments = np.zeros(n_employees)
prob_adjustments += np.where(df['satisfaction_score'] < 5, 0.2, 0)
prob_adjustments += np.where(df['years_at_company'] < 2, 0.15, 0)
prob_adjustments += np.where(df['salary'] < 50000, 0.1, 0)
prob_adjustments += np.where(df['overtime_hours'] > 10, 0.12, 0)
prob_adjustments += np.where(df['performance_rating'] <= 2, 0.15, 0)
prob_adjustments += np.where(df['commute_distance'] > 20, 0.08, 0)
final_prob = np.clip(attrition_prob + prob_adjustments, 0, 0.8)
df['attrition'] = np.random.binomial(1, final_prob)
# Task instructions for students
tasks = """
TODO: Complete the following tasks:
1. DATA EXPLORATION
- Analyze target variable distribution
- Identify numerical vs categorical features
- Check for missing values and outliers
- Calculate correlation with target
2. FEATURE ENGINEERING
- Create new features (e.g., tenure categories, salary bands)
- Handle categorical variables
- Scale numerical features
3. MODEL BUILDING
- Split data into train/validation/test
- Try multiple algorithms
- Perform hyperparameter tuning
4. EVALUATION
- Calculate comprehensive metrics
- Create confusion matrix and ROC curve
- Analyze feature importance
5. DEPLOYMENT PREP
- Create prediction pipeline
- Validate on holdout test set
- Document model performance
"""
print(tasks)
# Provide starter code structure
starter_code = """
# Starter code structure:
# 1. Load and explore data
print("Dataset shape:", df.shape)
print("Attrition rate:", df['attrition'].mean())
# 2. EDA - Add your analysis here
# TODO: Implement exploratory data analysis
# 3. Preprocessing - Add your preprocessing here
# TODO: Feature engineering and preprocessing
# 4. Model training - Add your models here
# TODO: Train and evaluate multiple models
# 5. Final evaluation - Add evaluation code here
# TODO: Comprehensive model evaluation
"""
print(starter_code)
return dfdef lab_model_interpretability(): ““” Lab Exercise: Implement model interpretability techniques
Covers:
- SHAP values
- LIME explanations
- Feature importance analysis
- Partial dependence plots
"""
print("LAB 2: MODEL INTERPRETABILITY")
print("=" * 40)
tasks = """
INTERPRETABILITY LAB TASKS:
1. GLOBAL INTERPRETABILITY
- Calculate and plot feature importance
- Create partial dependence plots
- Analyze feature interactions
2. LOCAL INTERPRETABILITY
- Implement SHAP explanations
- Use LIME for individual predictions
- Create explanation dashboards
3. MODEL COMPARISON
- Compare interpretability across model types
- Analyze trade-offs between accuracy and interpretability
4. BUSINESS INSIGHTS
- Translate technical insights to business language
- Identify actionable insights
- Create executive summary
"""
print(tasks)
# Sample interpretability code
sample_code = """
# Sample interpretability implementation:
import shap
from lime import lime_tabular
import matplotlib.pyplot as plt
# SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
# LIME explanations
lime_explainer = lime_tabular.LimeTabularExplainer(
X_train, feature_names=feature_names, mode='classification'
)
explanation = lime_explainer.explain_instance(
X_test.iloc[0], model.predict_proba
)
# Partial dependence plots
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(model, X_train, features=[0, 1, (0, 1)])
"""
print(sample_code)
def lab_mlops_pipeline(): ““” Lab Exercise: Implement MLOps pipeline with monitoring and deployment
Covers:
- Model versioning
- Automated retraining
- Performance monitoring
- A/B testing setup
"""
print("LAB 3: MLOPS PIPELINE")
print("=" * 40)
tasks = """
MLOPS PIPELINE TASKS:
1. VERSION CONTROL
- Set up model versioning system
- Track experiments and parameters
- Implement model registry
2. AUTOMATED PIPELINE
- Create training pipeline
- Implement validation checks
- Set up automated deployment
3. MONITORING SETUP
- Implement data drift detection
- Set up performance monitoring
- Create alerting system
4. A/B TESTING
- Design A/B testing framework
- Implement traffic splitting
- Set up metrics collection
"""
print(tasks)
pipeline_template = """
# MLOps Pipeline Template:
class MLOpsPipeline:
def __init__(self, model_name, version):
self.model_name = model_name
self.version = version
self.model_registry = {}
def train_pipeline(self, data_path):
# Load data
# Preprocess
# Train model
# Validate performance
# Register model if valid
pass
def deploy_model(self, model_version):
# Load model from registry
# Create deployment package
# Deploy to production
# Update monitoring
pass
def monitor_performance(self):
# Check data drift
# Monitor accuracy
# Check for anomalies
# Send alerts if needed
pass
"""
print(pipeline_template)
\newpage
## 8.7 Best Practices and Common Pitfalls
### 8.7.1 Data Science Best Practices
**1. Data Quality Assurance**
- Always validate data quality before modeling
- Document data sources and collection methods
- Implement data validation checks in production
- Monitor for data drift and quality degradation
**2. Reproducibility**
- Set random seeds for all stochastic processes
- Version control all code and configuration
- Document environment dependencies
- Use containerization for deployment consistency
**3. Model Validation**
- Use appropriate cross-validation strategies
- Hold out a final test set for unbiased evaluation
- Validate on out-of-time data when applicable
- Test model robustness with adversarial examples
### 8.7.2 Common Pitfalls and How to Avoid Them
```python
class MLProjectPitfalls:
"""Common pitfalls in ML projects and how to avoid them"""
@staticmethod
def data_leakage_examples():
"""Examples of data leakage and prevention"""
pitfalls = {
"Future Information Leakage": {
"description": "Using information that wouldn't be available at prediction time",
"example": "Including 'days_since_last_transaction' in a fraud detection model where you're trying to predict fraud in real-time",
"solution": "Carefully review features for temporal consistency"
},
"Target Leakage": {
"description": "Including features that are direct derivatives of the target",
"example": "Using 'approved_loan_amount' to predict loan approval",
"solution": "Remove features that are consequences of the target variable"
},
"Train-Test Contamination": {
"description": "Information from test set influencing training",
"example": "Scaling features using statistics from entire dataset before splitting",
"solution": "Always split data before any preprocessing that involves statistics"
}
}
return pitfalls
@staticmethod
def sampling_bias_prevention():
"""Prevent sampling and selection bias"""
prevention_strategies = {
"Temporal Splits": "Use time-based splits for time series data",
"Stratified Sampling": "Maintain class distributions across splits",
"Representative Sampling": "Ensure test data represents production distribution",
"Cross-Validation": "Use appropriate CV strategy for your data type"
}
return prevention_strategies
@staticmethod
def overfitting_prevention():
"""Comprehensive overfitting prevention strategies"""
strategies = {
"Regularization": "Use L1/L2 regularization or dropout",
"Early Stopping": "Stop training when validation performance degrades",
"Feature Selection": "Remove irrelevant/redundant features",
"Cross-Validation": "Use proper validation to assess generalization",
"Ensemble Methods": "Combine multiple models to reduce variance",
"Data Augmentation": "Increase training data diversity when possible"
}
return strategies
# Checklist for ML Project Success
ml_project_checklist = {
"Business Understanding": [
"✅ Clear problem definition",
"✅ Success metrics defined",
"✅ Stakeholder alignment",
"✅ Resource constraints identified"
],
"Data Preparation": [
"✅ Data quality assessed",
"✅ Missing value strategy defined",
"✅ Feature engineering completed",
"✅ Data leakage prevented"
],
"Modeling": [
"✅ Baseline model established",
"✅ Multiple algorithms tested",
"✅ Hyperparameters optimized",
"✅ Cross-validation performed"
],
"Evaluation": [
"✅ Appropriate metrics selected",
"✅ Business impact calculated",
"✅ Model interpretability assessed",
"✅ Bias and fairness evaluated"
],
"Deployment": [
"✅ Production pipeline designed",
"✅ Monitoring strategy implemented",
"✅ Rollback plan prepared",
"✅ Documentation completed"
]
}
This chapter provided comprehensive coverage of end-to-end machine learning projects through the CRISP-DM methodology and four detailed case studies:
CRISP-DM Methodology: Structured approach to ML projects with six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.
Case Studies Completed:
Technical Implementation: Complete code examples for data preprocessing, feature engineering, model selection, hyperparameter tuning, and evaluation.
Business Integration: Frameworks for translating technical results into business value and actionable insights.
MLOps Considerations: Model deployment, monitoring, and maintenance strategies for production systems.
The next chapter will focus on Model Selection and Evaluation techniques, diving deeper into advanced evaluation methodologies, cross-validation strategies, and model comparison frameworks that build upon the foundation established in this chapter.
Apply the CRISP-DM methodology to a new domain (e.g., healthcare, retail, manufacturing). Document each phase and identify domain-specific challenges.
Given a raw dataset, implement comprehensive feature engineering including: - Temporal features - Interaction terms - Domain-specific transformations - Dimensionality reduction
Take one of the case study models and implement multiple interpretability techniques: - SHAP values - LIME explanations - Permutation importance - Partial dependence plots
Design and implement a complete deployment pipeline including: - Model packaging - API creation - Monitoring setup - A/B testing framework
For each case study, perform detailed business impact analysis including: - ROI calculations - Cost-benefit analysis - Risk assessment - Implementation timeline # Chapter 9: The Judge and Jury - Model Selection and Evaluation
By the end of this chapter, you will have transcended from model builder to algorithmic arbiter: - Orchestrate sophisticated cross-validation symphonies that reveal truth beyond randomness - Architect evaluation frameworks that separate genuine intelligence from statistical accidents - Wield statistical significance testing as your sword of scientific truth - Navigate the treacherous waters of imbalanced data, temporal dependencies, and domain constraints - Build automated model selection systems that think and adapt like experienced data scientists - Master the art of nested cross-validation✅the Zen of unbiased performance estimation - Craft custom metrics that speak the language of business value and human impact
“All models are wrong, but some are useful. The art is knowing which ones.” ✅ Adapted from George Box
Welcome to the most crucial chapter in your machine learning journey✅where we transform from optimistic model builders into rigorous evaluators of algorithmic truth. This is where the rubber meets the road, where dreams of perfect predictions encounter the harsh but beautiful reality of statistical validation.
Imagine you’re a judge in a court where the defendants are algorithms and the evidence is data. Your verdict doesn’t just affect academic scores✅it influences real decisions that impact real people. Will this medical diagnostic model save lives or give false hope? Will this loan approval algorithm promote fairness or perpetuate bias? Will this recommendation system delight users or trap them in filter bubbles?
This chapter is your judicial training academy for the algorithmic age. We don’t just compare numbers✅we develop the wisdom to distinguish between models that merely memorize and those that truly understand.
What awaits you in this transformative chapter:
🎯 Cross-Validation Mastery: Beyond simple train-test splits to sophisticated validation strategies that honor the complexity of real-world data
📊 Statistical Significance: Learning to hear the whispers of true signal above the shouts of random noise
“��️ Fair Comparison Frameworks: Building evaluation systems that give every algorithm a fair trial
🔮 Future-Proof Validation: Techniques that predict not just performance, but sustainability and reliability
🎭 Domain-Aware Evaluation: Adapting your judgment to the unique requirements of different industries and applications
This isn’t just about picking the highest accuracy score✅it’s about developing the intuitive wisdom that recognizes when a model is ready for the real world. You’ll learn to see beyond the surface metrics to understand the deeper questions: Does this model capture the essence of the problem? Will it degrade gracefully when conditions change? Can we trust its confidence estimates?
import numpy as np
import pandas as pd
from sklearn.model_selection import (KFold, StratifiedKFold, TimeSeriesSplit,
GroupKFold, LeaveOneGroupOut, cross_val_score)
from sklearn.metrics import make_scorer
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
class AdvancedCrossValidation:
def __init__(self, random_state=42):
self.random_state = random_state
self.cv_strategies = {}
self.results = {}
def stratified_group_kfold(self, X, y, groups, n_splits=5):
"""
Implement stratified group K-fold that maintains both group integrity
and class distribution balance
"""
from collections import defaultdict, Counter
# Group samples by group and class
group_class_counts = defaultdict(lambda: defaultdict(int))
for group, label in zip(groups, y):
group_class_counts[group][label] += 1
# Convert to list of (group, class_distribution)
groups_info = []
for group, class_counts in group_class_counts.items():
total_samples = sum(class_counts.values())
class_ratios = {cls: count/total_samples for cls, count in class_counts.items()}
groups_info.append((group, class_ratios, total_samples))
# Sort groups by size for better distribution
groups_info.sort(key=lambda x: x[2], reverse=True)
# Initialize folds
folds = [[] for _ in range(n_splits)]
fold_class_counts = [defaultdict(int) for _ in range(n_splits)]
# Assign groups to folds
for group, class_ratios, group_size in groups_info:
# Find fold with most similar class distribution
best_fold = 0
best_score = float('inf')
for fold_idx in range(n_splits):
# Calculate distribution similarity
fold_total = sum(fold_class_counts[fold_idx].values())
if fold_total == 0:
score = 0 # Empty fold, good choice
else:
score = 0
for cls in set(class_ratios.keys()) | set(fold_class_counts[fold_idx].keys()):
current_ratio = fold_class_counts[fold_idx][cls] / fold_total
target_ratio = class_ratios.get(cls, 0)
score += abs(current_ratio - target_ratio)
if score < best_score:
best_score = score
best_fold = fold_idx
# Assign group to best fold
folds[best_fold].append(group)
for cls, count in group_class_counts[group].items():
fold_class_counts[best_fold][cls] += count
# Generate train/test indices
for test_fold in range(n_splits):
test_groups = set(folds[test_fold])
train_indices = []
test_indices = []
for idx, group in enumerate(groups):
if group in test_groups:
test_indices.append(idx)
else:
train_indices.append(idx)
yield train_indices, test_indices
def temporal_cross_validation(self, X, y, time_column, n_splits=5, gap_size=0):
"""
Implement time-aware cross-validation with optional gap between train/test
"""
# Sort data by time
time_sorted_idx = np.argsort(X[time_column])
n_samples = len(X)
# Calculate fold sizes
test_size = n_samples // n_splits
for i in range(n_splits):
# Calculate test period
test_start = i * test_size
test_end = min((i + 1) * test_size, n_samples)
# Apply gap
train_end = max(0, test_start - gap_size)
# Get indices
train_indices = time_sorted_idx[:train_end].tolist()
test_indices = time_sorted_idx[test_start:test_end].tolist()
if len(train_indices) > 0 and len(test_indices) > 0:
yield train_indices, test_indices
def nested_cross_validation(self, model, param_grid, X, y, outer_cv=5, inner_cv=3,
scoring='accuracy'):
"""
Implement nested cross-validation for unbiased performance estimation
"""
from sklearn.model_selection import GridSearchCV, cross_val_score
# Outer CV for performance estimation
outer_scores = []
best_params_per_fold = []
outer_cv_splitter = KFold(n_splits=outer_cv, shuffle=True, random_state=self.random_state)
for fold_idx, (train_idx, test_idx) in enumerate(outer_cv_splitter.split(X)):
print(f"Processing outer fold {fold_idx + 1}/{outer_cv}")
X_train_outer, X_test_outer = X.iloc[train_idx], X.iloc[test_idx]
y_train_outer, y_test_outer = y.iloc[train_idx], y.iloc[test_idx]
# Inner CV for hyperparameter selection
inner_cv_splitter = KFold(n_splits=inner_cv, shuffle=True, random_state=self.random_state)
grid_search = GridSearchCV(
model, param_grid, cv=inner_cv_splitter,
scoring=scoring, n_jobs=-1
)
# Fit on outer training set
grid_search.fit(X_train_outer, y_train_outer)
# Evaluate best model on outer test set
best_model = grid_search.best_estimator_
score = best_model.score(X_test_outer, y_test_outer)
outer_scores.append(score)
best_params_per_fold.append(grid_search.best_params_)
print(f" Fold {fold_idx + 1} score: {score:.4f}")
print(f" Best params: {grid_search.best_params_}")
results = {
'outer_scores': outer_scores,
'mean_score': np.mean(outer_scores),
'std_score': np.std(outer_scores),
'best_params_per_fold': best_params_per_fold,
'cv_scores_detailed': outer_scores
}
print(f"\nNested CV Results:")
print(f"Mean score: {results['mean_score']:.4f} (+/- {results['std_score'] * 2:.4f})")
return results
def custom_cv_for_time_series(self, X, y, time_column, forecast_horizon=1,
min_train_size=None, step_size=1):
"""
Time series cross-validation with walk-forward validation
"""
# Sort by time
time_sorted = X.sort_values(time_column)
sorted_indices = time_sorted.index.tolist()
n_samples = len(X)
if min_train_size is None:
min_train_size = n_samples // 3
folds = []
for start_idx in range(min_train_size, n_samples - forecast_horizon, step_size):
train_indices = sorted_indices[:start_idx]
test_indices = sorted_indices[start_idx:start_idx + forecast_horizon]
if len(test_indices) == forecast_horizon:
folds.append((train_indices, test_indices))
print(f"Generated {len(folds)} time series CV folds")
return folds
def evaluate_cv_stability(self, model, X, y, cv_strategies, scoring='accuracy', n_repeats=5):
"""
Evaluate stability of cross-validation results across different strategies
"""
results = {}
for strategy_name, cv_splitter in cv_strategies.items():
strategy_scores = []
for repeat in range(n_repeats):
# Add randomness for repeated evaluation
if hasattr(cv_splitter, 'random_state'):
cv_splitter.random_state = self.random_state + repeat
scores = cross_val_score(model, X, y, cv=cv_splitter, scoring=scoring)
strategy_scores.extend(scores)
results[strategy_name] = {
'scores': strategy_scores,
'mean': np.mean(strategy_scores),
'std': np.std(strategy_scores),
'min': np.min(strategy_scores),
'max': np.max(strategy_scores),
'cv': np.std(strategy_scores) / np.mean(strategy_scores) # Coefficient of variation
}
return results
class ModelComparisonFramework:
def __init__(self, random_state=42):
self.random_state = random_state
self.comparison_results = {}
def statistical_comparison(self, model_results, alpha=0.05):
"""
Perform statistical significance testing between models
"""
from scipy.stats import ttest_rel, wilcoxon, friedmanchisquare
import itertools
model_names = list(model_results.keys())
comparison_matrix = pd.DataFrame(index=model_names, columns=model_names)
# Pairwise comparisons
for model1, model2 in itertools.combinations(model_names, 2):
scores1 = model_results[model1]['scores']
scores2 = model_results[model2]['scores']
# Paired t-test (assumes normality)
t_stat, t_pval = ttest_rel(scores1, scores2)
# Wilcoxon signed-rank test (non-parametric)
w_stat, w_pval = wilcoxon(scores1, scores2)
comparison_matrix.loc[model1, model2] = f't:{t_pval:.4f}, w:{w_pval:.4f}'
comparison_matrix.loc[model2, model1] = f't:{t_pval:.4f}, w:{w_pval:.4f}'
# Fill diagonal
for model in model_names:
comparison_matrix.loc[model, model] = '1.0000'
# Overall comparison (Friedman test for multiple models)
if len(model_names) > 2:
all_scores = [model_results[model]['scores'] for model in model_names]
friedman_stat, friedman_pval = friedmanchisquare(*all_scores)
print(f"Friedman test statistic: {friedman_stat:.4f}")
print(f"Friedman test p-value: {friedman_pval:.4f}")
if friedman_pval < alpha:
print("Significant difference detected between models (Friedman test)")
else:
print("No significant difference between models (Friedman test)")
return comparison_matrix
def effect_size_analysis(self, model_results):
"""
Calculate effect sizes (Cohen's d) for model comparisons
"""
import itertools
model_names = list(model_results.keys())
effect_sizes = {}
for model1, model2 in itertools.combinations(model_names, 2):
scores1 = np.array(model_results[model1]['scores'])
scores2 = np.array(model_results[model2]['scores'])
# Cohen's d
pooled_std = np.sqrt(((len(scores1) - 1) * np.var(scores1) +
(len(scores2) - 1) * np.var(scores2)) /
(len(scores1) + len(scores2) - 2))
cohens_d = (np.mean(scores1) - np.mean(scores2)) / pooled_std
effect_sizes[f"{model1}_vs_{model2}"] = {
'cohens_d': cohens_d,
'magnitude': self._interpret_effect_size(abs(cohens_d))
}
return effect_sizes
def _interpret_effect_size(self, d):
"""Interpret Cohen's d effect size"""
if d < 0.2:
return "negligible"
elif d < 0.5:
return "small"
elif d < 0.8:
return "medium"
else:
return "large"
def comprehensive_model_comparison(self, models, X, y, cv_strategy,
scoring_metrics=None, n_repeats=5):
"""
Comprehensive comparison of multiple models with multiple metrics
"""
if scoring_metrics is None:
scoring_metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
results = {}
for model_name, model in models.items():
print(f"Evaluating {model_name}...")
model_results = {}
for metric in scoring_metrics:
metric_scores = []
for repeat in range(n_repeats):
# Create fresh CV splitter for each repeat
if hasattr(cv_strategy, 'random_state'):
cv_strategy.random_state = self.random_state + repeat
try:
scores = cross_val_score(model, X, y, cv=cv_strategy,
scoring=metric, n_jobs=-1)
metric_scores.extend(scores)
except Exception as e:
print(f"Error evaluating {model_name} with {metric}: {str(e)}")
metric_scores = [0.0] * cv_strategy.n_splits
model_results[metric] = {
'scores': metric_scores,
'mean': np.mean(metric_scores),
'std': np.std(metric_scores),
'confidence_interval': self._calculate_confidence_interval(metric_scores)
}
results[model_name] = model_results
# Statistical comparisons for each metric
statistical_results = {}
for metric in scoring_metrics:
metric_results = {model: results[model][metric] for model in results.keys()}
statistical_results[metric] = self.statistical_comparison(metric_results)
return results, statistical_results
def _calculate_confidence_interval(self, scores, confidence=0.95):
"""Calculate confidence interval for scores"""
n = len(scores)
mean = np.mean(scores)
std_err = np.std(scores) / np.sqrt(n)
# t-distribution for small samples
from scipy.stats import t
t_value = t.ppf((1 + confidence) / 2, df=n-1)
margin_error = t_value * std_err
return (mean - margin_error, mean + margin_error)
def create_comparison_report(self, results, statistical_results):
"""Create comprehensive comparison report"""
print("=" * 80)
print("COMPREHENSIVE MODEL COMPARISON REPORT")
print("=" * 80)
# Performance summary
print("\n1. PERFORMANCE SUMMARY")
print("-" * 40)
for model_name, model_results in results.items():
print(f"\n{model_name}:")
for metric, metric_results in model_results.items():
mean_score = metric_results['mean']
std_score = metric_results['std']
ci_lower, ci_upper = metric_results['confidence_interval']
print(f" {metric:15s}: {mean_score:.4f} ± {std_score:.4f} "
f"[{ci_lower:.4f}, {ci_upper:.4f}]")
# Statistical significance
print(f"\n2. STATISTICAL SIGNIFICANCE")
print("-" * 40)
for metric, comparison_matrix in statistical_results.items():
print(f"\n{metric.upper()} - Pairwise p-values (t-test, wilcoxon):")
print(comparison_matrix)
# Recommendations
print(f"\n3. RECOMMENDATIONS")
print("-" * 40)
# Find best model for each metric
best_models = {}
for metric in results[list(results.keys())[0]].keys():
best_model = max(results.keys(),
key=lambda x: results[x][metric]['mean'])
best_score = results[best_model][metric]['mean']
best_models[metric] = (best_model, best_score)
for metric, (best_model, best_score) in best_models.items():
print(f"{metric:15s}: {best_model} ({best_score:.4f})")
return best_modelsclass CustomMetrics:
"""Custom evaluation metrics for business-specific objectives"""
@staticmethod
def profit_based_score(y_true, y_pred, cost_matrix):
"""
Calculate profit-based score using cost matrix
cost_matrix: dict with keys 'tp', 'fp', 'tn', 'fn' representing
profit/cost for each outcome
"""
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
total_profit = (tp * cost_matrix['tp'] +
fp * cost_matrix['fp'] +
tn * cost_matrix['tn'] +
fn * cost_matrix['fn'])
return total_profit
@staticmethod
def weighted_f1_custom(y_true, y_pred, class_weights):
"""Custom weighted F1 score with business-defined class weights"""
from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, support = precision_recall_fscore_support(
y_true, y_pred, average=None
)
weighted_f1 = sum(f1[i] * class_weights.get(i, 1.0) for i in range(len(f1)))
total_weight = sum(class_weights.values())
return weighted_f1 / total_weight
@staticmethod
def top_k_accuracy(y_true, y_pred_proba, k=3):
"""Calculate top-k accuracy for multi-class problems"""
top_k_preds = np.argsort(y_pred_proba, axis=1)[:, -k:]
correct = 0
for i, true_label in enumerate(y_true):
if true_label in top_k_preds[i]:
correct += 1
return correct / len(y_true)
@staticmethod
def regression_within_tolerance(y_true, y_pred, tolerance=0.1):
"""Percentage of predictions within tolerance for regression"""
relative_errors = np.abs((y_true - y_pred) / y_true)
return (relative_errors <= tolerance).mean()
@staticmethod
def business_impact_score(y_true, y_pred, impact_function):
"""
Generic business impact score using custom impact function
impact_function: function that takes (y_true, y_pred) and returns impact
"""
return impact_function(y_true, y_pred)
class ImbalancedDatasetEvaluation:
"""Specialized evaluation for imbalanced datasets"""
def __init__(self, positive_class=1):
self.positive_class = positive_class
def comprehensive_imbalanced_evaluation(self, y_true, y_pred, y_pred_proba=None):
"""Comprehensive evaluation for imbalanced datasets"""
from sklearn.metrics import (precision_recall_curve, average_precision_score,
roc_curve, auc, confusion_matrix, classification_report)
results = {}
# Basic metrics
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
# Calculate metrics manually for clarity
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
results['confusion_matrix'] = cm
results['precision'] = precision
results['recall'] = recall
results['specificity'] = specificity
results['f1_score'] = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
# Balanced accuracy
results['balanced_accuracy'] = (recall + specificity) / 2
# Matthews Correlation Coefficient
mcc_denominator = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
results['mcc'] = ((tp * tn) - (fp * fn)) / mcc_denominator if mcc_denominator > 0 else 0
if y_pred_proba is not None:
# Precision-Recall curve and AUC
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_pred_proba)
results['pr_auc'] = average_precision_score(y_true, y_pred_proba)
results['pr_curve'] = (precision_curve, recall_curve, pr_thresholds)
# ROC curve and AUC
fpr, tpr, roc_thresholds = roc_curve(y_true, y_pred_proba)
results['roc_auc'] = auc(fpr, tpr)
results['roc_curve'] = (fpr, tpr, roc_thresholds)
return results
def threshold_optimization(self, y_true, y_pred_proba, optimization_metric='f1'):
"""Optimize classification threshold for imbalanced datasets"""
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_proba)
if optimization_metric == 'f1':
# Find threshold that maximizes F1 score
f1_scores = 2 * (precision * recall) / (precision + recall)
f1_scores = np.nan_to_num(f1_scores) # Handle division by zero
best_threshold_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_threshold_idx]
best_score = f1_scores[best_threshold_idx]
elif optimization_metric == 'youden_index':
# Youden's J statistic: sensitivity + specificity - 1
fpr, tpr, roc_thresholds = roc_curve(y_true, y_pred_proba)
youden_index = tpr - fpr
best_threshold_idx = np.argmax(youden_index)
best_threshold = roc_thresholds[best_threshold_idx]
best_score = youden_index[best_threshold_idx]
elif optimization_metric == 'precision_at_recall':
# Find threshold for specific recall level
target_recall = 0.8 # Can be parameterized
valid_indices = recall >= target_recall
if np.any(valid_indices):
best_precision_idx = np.argmax(precision[valid_indices])
actual_idx = np.where(valid_indices)[0][best_precision_idx]
best_threshold = thresholds[actual_idx]
best_score = precision[actual_idx]
else:
best_threshold = 0.5
best_score = 0.0
return best_threshold, best_score
def cost_sensitive_evaluation(self, y_true, y_pred_proba, cost_matrix):
"""
Find optimal threshold based on cost matrix
cost_matrix: dict with 'tp', 'fp', 'tn', 'fn' costs
"""
thresholds = np.linspace(0.01, 0.99, 100)
costs = []
for threshold in thresholds:
y_pred_thresh = (y_pred_proba >= threshold).astype(int)
cm = confusion_matrix(y_true, y_pred_thresh)
if cm.shape == (2, 2):
tn, fp, fn, tp = cm.ravel()
total_cost = (tp * cost_matrix['tp'] +
fp * cost_matrix['fp'] +
tn * cost_matrix['tn'] +
fn * cost_matrix['fn'])
costs.append(total_cost)
else:
costs.append(float('inf'))
best_threshold_idx = np.argmin(costs)
best_threshold = thresholds[best_threshold_idx]
best_cost = costs[best_threshold_idx]
return best_threshold, best_cost, thresholds, costsclass TimeSeriesEvaluation:
"""Specialized evaluation methods for time series models"""
def __init__(self):
self.evaluation_results = {}
def walk_forward_validation(self, model, X, y, time_column,
initial_train_size=None, step_size=1,
forecast_horizon=1):
"""
Implement walk-forward validation for time series
"""
# Sort by time
time_sorted = X.sort_values(time_column)
X_sorted = time_sorted.drop(columns=[time_column])
y_sorted = y.loc[time_sorted.index]
n_samples = len(X_sorted)
if initial_train_size is None:
initial_train_size = n_samples // 3
predictions = []
actuals = []
train_sizes = []
for start_idx in range(initial_train_size,
n_samples - forecast_horizon + 1,
step_size):
# Training data: from beginning to current point
X_train = X_sorted.iloc[:start_idx]
y_train = y_sorted.iloc[:start_idx]
# Test data: next forecast_horizon points
X_test = X_sorted.iloc[start_idx:start_idx + forecast_horizon]
y_test = y_sorted.iloc[start_idx:start_idx + forecast_horizon]
# Train and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions.extend(y_pred)
actuals.extend(y_test.values)
train_sizes.append(len(X_train))
return np.array(predictions), np.array(actuals), train_sizes
def time_series_cv_with_gap(self, model, X, y, time_column,
n_splits=5, gap_size=0, test_size=None):
"""
Time series cross-validation with gap between train and test
"""
# Sort by time
time_sorted = X.sort_values(time_column)
X_sorted = time_sorted.drop(columns=[time_column])
y_sorted = y.loc[time_sorted.index]
n_samples = len(X_sorted)
if test_size is None:
test_size = n_samples // (n_splits + 1)
fold_results = []
for i in range(n_splits):
# Calculate split points
test_start = (i + 1) * test_size + i * gap_size
test_end = test_start + test_size
train_end = test_start - gap_size
if test_end > n_samples or train_end <= 0:
continue
# Split data
X_train = X_sorted.iloc[:train_end]
y_train = y_sorted.iloc[:train_end]
X_test = X_sorted.iloc[test_start:test_end]
y_test = y_sorted.iloc[test_start:test_end]
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate metrics
fold_metrics = self._calculate_ts_metrics(y_test.values, y_pred)
fold_results.append(fold_metrics)
return fold_results
def _calculate_ts_metrics(self, y_true, y_pred):
"""Calculate time series specific metrics"""
from sklearn.metrics import mean_squared_error, mean_absolute_error
metrics = {}
# Standard regression metrics
metrics['mse'] = mean_squared_error(y_true, y_pred)
metrics['rmse'] = np.sqrt(metrics['mse'])
metrics['mae'] = mean_absolute_error(y_true, y_pred)
# Time series specific metrics
# Mean Absolute Percentage Error
metrics['mape'] = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# Symmetric MAPE (handles zero values better)
metrics['smape'] = np.mean(2 * np.abs(y_pred - y_true) /
(np.abs(y_true) + np.abs(y_pred))) * 100
# Directional accuracy (for trend prediction)
if len(y_true) > 1:
true_direction = np.sign(np.diff(y_true))
pred_direction = np.sign(np.diff(y_pred))
metrics['directional_accuracy'] = np.mean(true_direction == pred_direction)
else:
metrics['directional_accuracy'] = np.nan
return metrics
def residual_analysis(self, y_true, y_pred, time_index=None):
"""Comprehensive residual analysis for time series"""
residuals = y_true - y_pred
analysis = {}
# Basic statistics
analysis['mean_residual'] = np.mean(residuals)
analysis['std_residual'] = np.std(residuals)
analysis['skewness'] = stats.skew(residuals)
analysis['kurtosis'] = stats.kurtosis(residuals)
# Normality test
_, analysis['normality_pvalue'] = stats.jarque_bera(residuals)
# Autocorrelation test (if time index provided)
if time_index is not None:
# Ljung-Box test for autocorrelation
from statsmodels.stats.diagnostic import acorr_ljungbox
lb_stat, lb_pvalue = acorr_ljungbox(residuals, lags=min(10, len(residuals)//4))
analysis['ljung_box_pvalue'] = lb_pvalue.iloc[-1] # Take last lag p-value
# Heteroscedasticity test
if len(y_pred) > 10:
# Breusch-Pagan test
from scipy.stats import pearsonr
_, analysis['heteroscedasticity_pvalue'] = pearsonr(np.abs(residuals), y_pred)
return analysis, residuals
def forecast_evaluation_metrics(self, y_true, y_pred, seasonal_period=None):
"""
Advanced forecast evaluation metrics
"""
metrics = {}
# Standard metrics
metrics.update(self._calculate_ts_metrics(y_true, y_pred))
# Forecast skill metrics
if seasonal_period is not None:
# Seasonal naive forecast for comparison
seasonal_naive = np.roll(y_true, seasonal_period)[:len(y_pred)]
seasonal_naive[:seasonal_period] = y_true[:seasonal_period] # Fill initial values
naive_mse = mean_squared_error(y_true, seasonal_naive)
model_mse = mean_squared_error(y_true, y_pred)
# Forecast skill score
metrics['forecast_skill'] = 1 - (model_mse / naive_mse)
# Prediction interval coverage (if available)
# This would require prediction intervals from the model
return metricsclass AutomatedModelSelection:
"""Automated model selection and hyperparameter optimization pipeline"""
def __init__(self, random_state=42, n_jobs=-1):
self.random_state = random_state
self.n_jobs = n_jobs
self.results = {}
self.best_model = None
self.selection_history = []
def define_search_space(self, problem_type='classification'):
"""Define comprehensive search space for different problem types"""
if problem_type == 'classification':
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb
search_space = {
'logistic_regression': {
'model': LogisticRegression(random_state=self.random_state),
'params': {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['liblinear', 'saga']
}
},
'random_forest': {
'model': RandomForestClassifier(random_state=self.random_state),
'params': {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
},
'gradient_boosting': {
'model': GradientBoostingClassifier(random_state=self.random_state),
'params': {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0]
}
},
'xgboost': {
'model': xgb.XGBClassifier(random_state=self.random_state),
'params': {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0]
}
}
}
elif problem_type == 'regression':
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
import xgboost as xgb
search_space = {
'linear_regression': {
'model': LinearRegression(),
'params': {}
},
'ridge_regression': {
'model': Ridge(random_state=self.random_state),
'params': {
'alpha': [0.001, 0.01, 0.1, 1, 10, 100]
}
},
'lasso_regression': {
'model': Lasso(random_state=self.random_state),
'params': {
'alpha': [0.001, 0.01, 0.1, 1, 10, 100]
}
},
'random_forest': {
'model': RandomForestRegressor(random_state=self.random_state),
'params': {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
},
'xgboost': {
'model': xgb.XGBRegressor(random_state=self.random_state),
'params': {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7]
}
}
}
return search_space
def progressive_model_selection(self, X, y, problem_type='classification',
cv_strategy=None, scoring=None,
max_iterations=10, early_stopping_rounds=3):
"""
Progressive model selection with early stopping
"""
from sklearn.model_selection import RandomizedSearchCV
if cv_strategy is None:
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=self.random_state)
if scoring is None:
scoring = 'accuracy' if problem_type == 'classification' else 'r2'
search_space = self.define_search_space(problem_type)
model_scores = []
no_improvement_count = 0
best_score = -np.inf
print("Starting progressive model selection...")
for iteration in range(max_iterations):
print(f"\nIteration {iteration + 1}/{max_iterations}")
# Select models to evaluate (can implement smart selection here)
models_to_evaluate = list(search_space.keys())
iteration_results = {}
for model_name in models_to_evaluate:
model_config = search_space[model_name]
if model_config['params']:
# Randomized search for hyperparameters
search = RandomizedSearchCV(
model_config['model'],
model_config['params'],
n_iter=20, # Reduced for progressive approach
cv=cv_strategy,
scoring=scoring,
n_jobs=self.n_jobs,
random_state=self.random_state + iteration
)
search.fit(X, y)
best_model = search.best_estimator_
best_score_iter = search.best_score_
else:
# No hyperparameters to tune
model_config['model'].fit(X, y)
scores = cross_val_score(model_config['model'], X, y,
cv=cv_strategy, scoring=scoring)
best_score_iter = scores.mean()
best_model = model_config['model']
iteration_results[model_name] = {
'score': best_score_iter,
'model': best_model
}
print(f" {model_name}: {best_score_iter:.4f}")
# Find best model in this iteration
best_model_name = max(iteration_results.keys(),
key=lambda x: iteration_results[x]['score'])
iter_best_score = iteration_results[best_model_name]['score']
model_scores.append(iter_best_score)
# Check for improvement
if iter_best_score > best_score:
best_score = iter_best_score
self.best_model = iteration_results[best_model_name]['model']
no_improvement_count = 0
print(f" New best score: {best_score:.4f} ({best_model_name})")
else:
no_improvement_count += 1
print(f" No improvement for {no_improvement_count} iterations")
# Early stopping
if no_improvement_count >= early_stopping_rounds:
print(f" Early stopping after {early_stopping_rounds} iterations without improvement")
break
# Update search space based on results (adaptive approach)
self._update_search_space(search_space, iteration_results)
self.results['progressive_selection'] = {
'scores_by_iteration': model_scores,
'best_score': best_score,
'best_model': self.best_model,
'total_iterations': iteration + 1
}
return self.best_model, best_score
def _update_search_space(self, search_space, iteration_results):
"""Update search space based on iteration results (simplified version)"""
# This is a placeholder for adaptive search space modification
# In practice, you might:
# 1. Focus on promising model types
# 2. Narrow hyperparameter ranges around good values
# 3. Add new models based on ensemble opportunities
pass
def bayesian_optimization_selection(self, X, y, problem_type='classification',
cv_strategy=None, n_calls=50):
"""
Model selection using Bayesian optimization
Requires scikit-optimize: pip install scikit-optimize
"""
try:
from skopt import gp_minimize
from skopt.space import Real, Integer, Categorical
from skopt.utils import use_named_args
except ImportError:
print("scikit-optimize not available. Install with: pip install scikit-optimize")
return None, None
if cv_strategy is None:
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=self.random_state)
# Define search space for Bayesian optimization
search_dimensions = [
Categorical(['random_forest', 'gradient_boosting', 'xgboost'], name='model_type'),
Integer(50, 200, name='n_estimators'),
Real(0.01, 0.3, name='learning_rate'),
Integer(3, 10, name='max_depth'),
Real(0.1, 1.0, name='subsample')
]
@use_named_args(search_dimensions)
def objective(**params):
# Create model based on parameters
model_type = params['model_type']
if model_type == 'random_forest':
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
ModelClass = RandomForestClassifier if problem_type == 'classification' else RandomForestRegressor
model = ModelClass(
n_estimators=params['n_estimators'],
max_depth=params['max_depth'],
random_state=self.random_state
)
elif model_type == 'gradient_boosting':
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
ModelClass = GradientBoostingClassifier if problem_type == 'classification' else GradientBoostingRegressor
model = ModelClass(
n_estimators=params['n_estimators'],
learning_rate=params['learning_rate'],
max_depth=params['max_depth'],
subsample=params['subsample'],
random_state=self.random_state
)
else: # xgboost
import xgboost as xgb
ModelClass = xgb.XGBClassifier if problem_type == 'classification' else xgb.XGBRegressor
model = ModelClass(
n_estimators=params['n_estimators'],
learning_rate=params['learning_rate'],
max_depth=params['max_depth'],
subsample=params['subsample'],
random_state=self.random_state
)
# Evaluate model
scoring = 'accuracy' if problem_type == 'classification' else 'r2'
scores = cross_val_score(model, X, y, cv=cv_strategy, scoring=scoring)
# Return negative score for minimization
return -scores.mean()
# Run Bayesian optimization
print("Running Bayesian optimization...")
result = gp_minimize(objective, search_dimensions, n_calls=n_calls,
random_state=self.random_state)
# Extract best parameters and create best model
best_params = dict(zip([dim.name for dim in search_dimensions], result.x))
print(f"Best parameters: {best_params}")
print(f"Best score: {-result.fun:.4f}")
# Create and return best model
# ... (implementation similar to objective function)
return result, best_params
def ensemble_model_selection(self, X, y, base_models, cv_strategy=None,
ensemble_methods=['voting', 'stacking']):
"""
Evaluate ensemble methods with selected base models
"""
from sklearn.ensemble import VotingClassifier, VotingRegressor
from sklearn.model_selection import cross_val_score
if cv_strategy is None:
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=self.random_state)
ensemble_results = {}
# Voting ensemble
if 'voting' in ensemble_methods:
problem_type = 'classification' if hasattr(base_models[0][1], 'predict_proba') else 'regression'
if problem_type == 'classification':
voting_ensemble = VotingClassifier(base_models, voting='soft')
scoring = 'accuracy'
else:
voting_ensemble = VotingRegressor(base_models)
scoring = 'r2'
voting_scores = cross_val_score(voting_ensemble, X, y, cv=cv_strategy, scoring=scoring)
ensemble_results['voting'] = {
'scores': voting_scores,
'mean_score': voting_scores.mean(),
'std_score': voting_scores.std(),
'model': voting_ensemble
}
print(f"Voting ensemble score: {voting_scores.mean():.4f} ± {voting_scores.std():.4f}")
# Stacking ensemble
if 'stacking' in ensemble_methods:
from sklearn.ensemble import StackingClassifier, StackingRegressor
from sklearn.linear_model import LogisticRegression, Ridge
problem_type = 'classification' if hasattr(base_models[0][1], 'predict_proba') else 'regression'
if problem_type == 'classification':
meta_learner = LogisticRegression(random_state=self.random_state)
stacking_ensemble = StackingClassifier(base_models, final_estimator=meta_learner,
cv=3, n_jobs=self.n_jobs)
scoring = 'accuracy'
else:
meta_learner = Ridge(random_state=self.random_state)
stacking_ensemble = StackingRegressor(base_models, final_estimator=meta_learner,
cv=3, n_jobs=self.n_jobs)
scoring = 'r2'
stacking_scores = cross_val_score(stacking_ensemble, X, y, cv=cv_strategy, scoring=scoring)
ensemble_results['stacking'] = {
'scores': stacking_scores,
'mean_score': stacking_scores.mean(),
'std_score': stacking_scores.std(),
'model': stacking_ensemble
}
print(f"Stacking ensemble score: {stacking_scores.mean():.4f} ± {stacking_scores.std():.4f}")
return ensemble_resultsdef comprehensive_model_evaluation_lab():
"""
Comprehensive lab for advanced model evaluation techniques
"""
print("ADVANCED MODEL EVALUATION LAB")
print("=" * 50)
# Generate sample dataset for demonstration
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_clusters_per_class=1,
class_sep=0.8, random_state=42)
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_series = pd.Series(y)
print(f"Dataset created: {X_df.shape[0]} samples, {X_df.shape[1]} features")
print(f"Class distribution: {pd.Series(y).value_counts().to_dict()}")
# 1. Advanced Cross-Validation
print("\n1. ADVANCED CROSS-VALIDATION")
print("-" * 30)
cv_framework = AdvancedCrossValidation()
# Define CV strategies
cv_strategies = {
'standard_kfold': KFold(n_splits=5, shuffle=True, random_state=42),
'stratified_kfold': StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
}
# Evaluate CV stability
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
stability_results = cv_framework.evaluate_cv_stability(
rf_model, X_df, y_series, cv_strategies, n_repeats=3
)
print("CV Stability Results:")
for strategy, results in stability_results.items():
print(f" {strategy}: Mean={results['mean']:.4f}, CV={results['cv']:.4f}")
# 2. Nested Cross-Validation
print("\n2. NESTED CROSS-VALIDATION")
print("-" * 30)
param_grid = {
'n_estimators': [50, 100, 150],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
nested_results = cv_framework.nested_cross_validation(
RandomForestClassifier(random_state=42),
param_grid, X_df, y_series,
outer_cv=3, inner_cv=3
)
# 3. Comprehensive Model Comparison
print("\n3. COMPREHENSIVE MODEL COMPARISON")
print("-" * 30)
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'Logistic Regression': LogisticRegression(random_state=42),
'SVM': SVC(random_state=42, probability=True)
}
comparison_framework = ModelComparisonFramework()
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results, statistical_results = comparison_framework.comprehensive_model_comparison(
models, X_df, y_series, cv_strategy,
scoring_metrics=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'],
n_repeats=2
)
# Generate comparison report
best_models = comparison_framework.create_comparison_report(results, statistical_results)
# 4. Custom Business Metrics
print("\n4. CUSTOM BUSINESS METRICS")
print("-" * 30)
# Example: Cost-sensitive evaluation
cost_matrix = {
'tp': 100, # Revenue from correctly identifying positive
'tn': 10, # Cost savings from correctly identifying negative
'fp': -50, # Cost of false positive
'fn': -200 # Cost of missing positive
}
# Train best model and evaluate with custom metric
best_model = models['Random Forest']
best_model.fit(X_df, y_series)
y_pred = best_model.predict(X_df)
profit_score = CustomMetrics.profit_based_score(y_series, y_pred, cost_matrix)
print(f"Profit-based score: ${profit_score:,.0f}")
# 5. Automated Model Selection
print("\n5. AUTOMATED MODEL SELECTION")
print("-" * 30)
auto_selector = AutomatedModelSelection(random_state=42)
# Progressive selection
best_model_prog, best_score_prog = auto_selector.progressive_model_selection(
X_df, y_series, problem_type='classification',
max_iterations=3, early_stopping_rounds=2
)
print(f"Progressive selection - Best score: {best_score_prog:.4f}")
return {
'nested_cv_results': nested_results,
'model_comparison': results,
'statistical_comparison': statistical_results,
'best_models': best_models,
'automated_selection': auto_selector.results
}
# Run the comprehensive lab
if __name__ == "__main__":
lab_results = comprehensive_model_evaluation_lab()
## 9.6 Model Evaluation Best Practices and Common Pitfalls
### 9.6.1 Best Practices Checklist
```python
class ModelEvaluationBestPractices:
"""Comprehensive checklist and guidelines for model evaluation"""
@staticmethod
def evaluation_checklist():
"""Complete evaluation checklist for ML projects"""
checklist = {
"Data Splitting": [
"✅ Hold-out test set never used for model development",
"✅ Stratified splitting for imbalanced datasets",
"✅ Time-based splitting for temporal data",
"✅ Group-aware splitting when necessary",
"✅ Consistent random seeds for reproducibility"
],
"Cross-Validation": [
"✅ Appropriate CV strategy for data type",
"✅ Sufficient number of folds (typically 5-10)",
"✅ Nested CV for hyperparameter optimization",
"✅ Statistical significance testing between models",
"✅ Stability analysis across CV folds"
],
"Metric Selection": [
"✅ Metrics aligned with business objectives",
"✅ Multiple complementary metrics used",
"✅ Appropriate metrics for class imbalance",
"✅ Domain-specific metrics when applicable",
"✅ Confidence intervals reported"
],
"Model Comparison": [
"✅ Statistical significance testing performed",
"✅ Effect size analysis conducted",
"✅ Computational cost considered",
"✅ Interpretability requirements addressed",
"✅ Robustness analysis completed"
],
"Validation": [
"✅ Out-of-time validation for temporal data",
"✅ Out-of-sample validation on different populations",
"✅ Adversarial testing performed",
"✅ Performance monitoring plan established",
"✅ Model degradation thresholds defined"
]
}
return checklist
@staticmethod
def common_pitfalls():
"""Common pitfalls in model evaluation and how to avoid them"""
pitfalls = {
"Data Leakage": {
"description": "Information from the future or target leaking into features",
"examples": [
"Using statistics calculated on entire dataset before splitting",
"Including features derived from target variable",
"Using future information in time series models"
],
"prevention": [
"Always split data before any preprocessing",
"Careful feature engineering review",
"Time-aware validation for temporal data"
]
},
"Overfitting to Validation Set": {
"description": "Repeated model selection on same validation set",
"examples": [
"Multiple rounds of hyperparameter tuning on same validation set",
"Model selection based on validation performance only",
"Extensive feature selection using validation performance"
],
"prevention": [
"Use nested cross-validation",
"Hold-out final test set",
"Limit validation set usage"
]
},
"Inappropriate Metrics": {
"description": "Using metrics not suitable for the problem or business context",
"examples": [
"Using accuracy for highly imbalanced datasets",
"Ignoring class costs in business applications",
"Single metric evaluation for complex problems"
],
"prevention": [
"Understand business context and costs",
"Use multiple complementary metrics",
"Consider class imbalance and costs"
]
},
"Statistical Issues": {
"description": "Improper statistical analysis of results",
"examples": [
"Comparing models without significance testing",
"Ignoring multiple testing corrections",
"Assuming normal distribution of performance metrics"
],
"prevention": [
"Use appropriate statistical tests",
"Apply multiple testing corrections",
"Report confidence intervals"
]
}
}
return pitfalls
@staticmethod
def generate_evaluation_report(model_results, test_results, business_context):
"""Generate comprehensive evaluation report template"""
report_template = f"""
# Model Evaluation Report
## Executive Summary
- **Best Model**: {test_results.get('best_model_name', 'TBD')}
- **Performance**: {test_results.get('best_score', 'TBD'):.4f}
- **Business Impact**: {business_context.get('expected_impact', 'TBD')}
- **Recommendation**: {business_context.get('recommendation', 'TBD')}
## Model Performance Summary
### Cross-Validation Results
{ModelEvaluationBestPractices._format_cv_results(model_results)}
### Test Set Results
{ModelEvaluationBestPractices._format_test_results(test_results)}
### Statistical Significance
- Significance tests performed: Yes/No
- P-values: [Details]
- Effect sizes: [Details]
## Business Context Analysis
### Performance Requirements
- Minimum acceptable performance: {business_context.get('min_performance', 'TBD')}
- Current model meets requirements: Yes/No
- Performance vs. business metrics alignment: [Analysis]
### Implementation Considerations
- Computational requirements: [Details]
- Interpretability needs: [Assessment]
- Deployment constraints: [List]
- Monitoring plan: [Strategy]
## Risk Assessment
### Model Risks
- Overfitting risk: Low/Medium/High
- Generalization concerns: [Details]
- Bias/fairness issues: [Assessment]
- Robustness analysis: [Results]
### Mitigation Strategies
- [List of mitigation approaches]
## Recommendations
### Model Selection
- Primary recommendation: [Model + justification]
- Alternative options: [Backup models]
- Ensemble considerations: [Analysis]
### Next Steps
1. [Action item 1]
2. [Action item 2]
3. [Action item 3]
## Appendices
### A. Detailed Performance Metrics
[Comprehensive metrics table]
### B. Statistical Analysis
[Detailed statistical results]
### C. Code and Reproducibility
[Implementation details and reproduction instructions]
"""
return report_template
@staticmethod
def _format_cv_results(results):
"""Format cross-validation results for report"""
# Placeholder - would format actual results
return "Cross-validation results formatted here"
@staticmethod
def _format_test_results(results):
"""Format test results for report"""
# Placeholder - would format actual results
return "Test results formatted here"
class PerformanceMonitoringStrategy:
"""Strategy for monitoring model performance in production"""
def __init__(self, model_name, performance_thresholds):
self.model_name = model_name
self.performance_thresholds = performance_thresholds
self.monitoring_history = []
def setup_monitoring_framework(self):
"""Setup comprehensive monitoring framework"""
monitoring_components = {
"Data Quality Monitoring": {
"metrics": ["missing_value_rate", "data_drift_score", "feature_distribution_change"],
"thresholds": {"missing_value_rate": 0.05, "drift_score": 0.1},
"frequency": "daily"
},
"Performance Monitoring": {
"metrics": ["accuracy", "precision", "recall", "f1_score"],
"thresholds": self.performance_thresholds,
"frequency": "weekly"
},
"Business Metrics Monitoring": {
"metrics": ["conversion_rate", "revenue_impact", "cost_savings"],
"thresholds": {"conversion_rate": 0.02, "revenue_impact": 0.05},
"frequency": "monthly"
},
"Model Behavior Monitoring": {
"metrics": ["prediction_distribution", "confidence_scores", "feature_importance_drift"],
"thresholds": {"prediction_drift": 0.1, "confidence_threshold": 0.7},
"frequency": "daily"
}
}
return monitoring_components
def define_alerting_rules(self):
"""Define alerting rules for different scenarios"""
alerting_rules = {
"Critical Alerts": {
"triggers": [
"Performance drops below minimum threshold",
"Data pipeline failure",
"Model prediction errors spike"
],
"response_time": "immediate",
"escalation": "on-call engineer + ML team lead"
},
"Warning Alerts": {
"triggers": [
"Performance declining trend",
"Data drift detected",
"Unusual prediction patterns"
],
"response_time": "within 4 hours",
"escalation": "ML team"
},
"Info Alerts": {
"triggers": [
"Weekly performance report",
"Monthly model review due",
"Scheduled retraining recommended"
],
"response_time": "next business day",
"escalation": "model owner"
}
}
return alerting_rules
def retraining_strategy(self):
"""Define model retraining strategy"""
retraining_strategy = {
"Scheduled Retraining": {
"frequency": "monthly",
"triggers": ["calendar_schedule"],
"validation_required": True
},
"Performance-Based Retraining": {
"frequency": "as_needed",
"triggers": ["performance_degradation", "data_drift"],
"validation_required": True
},
"Emergency Retraining": {
"frequency": "immediate",
"triggers": ["critical_performance_drop", "data_quality_issues"],
"validation_required": True,
"rollback_plan": True
}
}
return retraining_strategyThis chapter provided comprehensive coverage of advanced model selection and evaluation techniques essential for building robust, production-ready machine learning systems.
This chapter serves as a comprehensive guide for implementing robust model evaluation processes that ensure reliable model selection and long-term performance in production environments.
Implement a custom cross-validation strategy for a specific domain (e.g., medical diagnosis with patient groups, financial time series with market regimes). Include: - Custom splitting logic - Appropriate evaluation metrics - Statistical significance testing
Design a complete evaluation framework for a specific business problem: - Define custom business metrics - Implement cost-sensitive evaluation - Create decision thresholds optimization - Develop ROI analysis
Build an automated model selection pipeline that includes: - Progressive model selection with early stopping - Bayesian optimization for hyperparameter tuning - Ensemble method evaluation - Statistical comparison and reporting
Conduct a comprehensive model comparison study: - Compare at least 5 different algorithms - Use multiple evaluation metrics - Perform statistical significance testing - Analyze effect sizes and practical significance - Create detailed comparison report
Design and implement a production model monitoring system: - Performance degradation detection - Data drift monitoring - Automated alerting rules - Retraining trigger mechanisms - Performance dashboard creation # Chapter 10: The Guardian’s Oath - Ethics and Deployment in the Age of AI
By the end of this chapter, you will have evolved from a technical practitioner to a Guardian of Algorithmic Justice: - Recognize and heal the hidden wounds of bias that algorithms inherit from human history - Architect fairness-aware systems that embody our highest aspirations for equity and justice - Design transparent AI that invites trust rather than demanding blind faith - Deploy intelligent systems with the wisdom of a seasoned guardian, anticipating risks before they manifest - Build governance frameworks that ensure AI remains humanity’s servant, not its master - Navigate the complex landscape of AI regulation with both compliance expertise and ethical intuition - Master the art of explainable AI✅making the invisible visible, the complex comprehensible
“With great power comes great responsibility.” ✅ Uncle Ben (and every AI practitioner worth their salt)
Welcome to the most important chapter of your machine learning journey✅where technical excellence meets moral imperative, where algorithmic power encounters human wisdom, and where your code becomes a reflection of your values and your vision for the future.
This is not just another technical chapter. This is your oath-taking ceremony as a guardian of one of humanity’s most powerful technologies. Every line of code you write from this moment forward carries the potential to uplift or oppress, to illuminate or obscure, to connect or divide.
Imagine you’re standing at the threshold of a new era✅one where algorithms help doctors diagnose diseases, judges determine sentences, employers make hiring decisions, and financial institutions approve loans. In this brave new world, you are not just a programmer; you are an architect of society’s digital infrastructure.
The models you build will touch millions of lives in ways both seen and unseen. The biases you fail to address will echo through generations. The fairness you embed will become tomorrow’s justice. The transparency you provide will determine whether AI becomes humanity’s greatest tool or its most dangerous black box.
This chapter is your transformation story✅from technical practitioner to ethical guardian:
“��️ The Bias Hunter: Learning to see the invisible prejudices that hide in data and algorithms
🛡️ The Fairness Architect: Building systems that actively promote equity rather than merely avoiding obvious discrimination
🔍 The Transparency Wizard: Making black boxes into glass houses where every decision can be understood and questioned
�” The Deployment Sage: Launching AI systems with the wisdom to anticipate failure modes and the humility to monitor for unintended consequences
�” The Governance Craftsperson: Creating frameworks that ensure AI remains accountable to human values
🌍 The Future Guardian: Preparing for tomorrow’s challenges while addressing today’s responsibilities
This isn’t just about following guidelines or checking compliance boxes✅it’s about developing the ethical intuition that will guide you through unprecedented decisions in an rapidly evolving field. You’ll learn to ask not just “Can we build this?” but “Should we build this?” and “How do we build this responsibly?”
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
class BiasDetectionFramework:
"""Framework for detecting and analyzing bias in ML systems"""
def __init__(self):
self.bias_metrics = {}
self.fairness_metrics = {}
def generate_biased_dataset(self, n_samples=10000):
"""Generate a dataset with realistic bias patterns for demonstration"""
np.random.seed(42)
# Protected attributes
gender = np.random.choice(['Male', 'Female'], n_samples, p=[0.6, 0.4])
race = np.random.choice(['White', 'Black', 'Hispanic', 'Asian'],
n_samples, p=[0.6, 0.2, 0.15, 0.05])
age = np.random.normal(40, 12, n_samples)
age = np.clip(age, 18, 70).astype(int)
# Correlated features (introducing bias)
education_bias = {'Male': 0.7, 'Female': 0.5} # Gender bias in education
race_bias = {'White': 0.8, 'Black': 0.4, 'Hispanic': 0.5, 'Asian': 0.9}
education_level = []
for i in range(n_samples):
base_prob = education_bias[gender[i]] * race_bias[race[i]]
# Add age effect
age_factor = 1.0 if age[i] > 25 else 0.7
final_prob = base_prob * age_factor
education_level.append(np.random.choice(['High School', 'Bachelor', 'Masters', 'PhD'],
p=self._normalize_education_probs(final_prob)))
# Work experience (biased by gender and race)
experience_years = []
for i in range(n_samples):
base_exp = max(0, age[i] - 22) # Start working at 22
# Gender bias in career progression
gender_penalty = 0.8 if gender[i] == 'Female' else 1.0
# Race bias in opportunities
race_multiplier = {'White': 1.0, 'Black': 0.7, 'Hispanic': 0.8, 'Asian': 0.95}
final_exp = base_exp * gender_penalty * race_multiplier[race[i]]
experience_years.append(max(0, int(final_exp + np.random.normal(0, 2))))
# Salary (outcome variable with embedded bias)
salary = []
education_salary_map = {'High School': 40000, 'Bachelor': 60000,
'Masters': 80000, 'PhD': 100000}
for i in range(n_samples):
base_salary = education_salary_map[education_level[i]]
# Experience bonus
exp_bonus = experience_years[i] * 1500
# Gender pay gap
gender_multiplier = 0.82 if gender[i] == 'Female' else 1.0
# Race-based salary discrimination
race_salary_multiplier = {'White': 1.0, 'Black': 0.85, 'Hispanic': 0.88, 'Asian': 1.05}
final_salary = (base_salary + exp_bonus) * gender_multiplier * race_salary_multiplier[race[i]]
salary.append(int(final_salary + np.random.normal(0, 5000)))
# Binary outcome: High performer (biased selection)
high_performer = []
for i in range(n_samples):
# Base probability from salary and experience
base_prob = min(0.8, (salary[i] / 120000) * 0.5 + (experience_years[i] / 20) * 0.3)
# Add bias in performance evaluation
gender_bias_perf = 0.9 if gender[i] == 'Female' else 1.0 # Harder standards for women
race_bias_perf = {'White': 1.0, 'Black': 0.8, 'Hispanic': 0.85, 'Asian': 1.1}
final_prob = base_prob * gender_bias_perf * race_bias_perf[race[i]]
high_performer.append(np.random.binomial(1, min(0.9, final_prob)))
# Create DataFrame
df = pd.DataFrame({
'gender': gender,
'race': race,
'age': age,
'education_level': education_level,
'experience_years': experience_years,
'salary': salary,
'high_performer': high_performer
})
return df
def _normalize_education_probs(self, base_prob):
"""Normalize education level probabilities"""
if base_prob > 0.8:
return [0.1, 0.3, 0.4, 0.2] # Higher education
elif base_prob > 0.6:
return [0.2, 0.4, 0.3, 0.1] # Medium education
elif base_prob > 0.4:
return [0.4, 0.4, 0.15, 0.05] # Lower-medium education
else:
return [0.6, 0.3, 0.08, 0.02] # Lower education
def detect_statistical_parity_bias(self, y_true, y_pred, protected_attribute):
"""Detect statistical parity violations"""
results = {}
for group in protected_attribute.unique():
group_mask = protected_attribute == group
group_positive_rate = y_pred[group_mask].mean()
results[group] = group_positive_rate
# Calculate disparate impact
majority_group = max(results.keys(), key=lambda x: (protected_attribute == x).sum())
minority_groups = [g for g in results.keys() if g != majority_group]
disparate_impacts = {}
for minority_group in minority_groups:
if results[majority_group] > 0:
impact = results[minority_group] / results[majority_group]
disparate_impacts[minority_group] = impact
return {
'positive_rates': results,
'disparate_impacts': disparate_impacts,
'majority_group': majority_group
}
def detect_equalized_odds_bias(self, y_true, y_pred, protected_attribute):
"""Detect equalized odds violations"""
results = {}
for group in protected_attribute.unique():
group_mask = protected_attribute == group
# True Positive Rate (Sensitivity)
group_y_true = y_true[group_mask]
group_y_pred = y_pred[group_mask]
if (group_y_true == 1).sum() > 0:
tpr = ((group_y_true == 1) & (group_y_pred == 1)).sum() / (group_y_true == 1).sum()
else:
tpr = 0
# False Positive Rate
if (group_y_true == 0).sum() > 0:
fpr = ((group_y_true == 0) & (group_y_pred == 1)).sum() / (group_y_true == 0).sum()
else:
fpr = 0
results[group] = {'tpr': tpr, 'fpr': fpr}
return results
def detect_predictive_parity_bias(self, y_true, y_pred, protected_attribute):
"""Detect predictive parity violations (equal PPV across groups)"""
results = {}
for group in protected_attribute.unique():
group_mask = protected_attribute == group
group_y_true = y_true[group_mask]
group_y_pred = y_pred[group_mask]
# Positive Predictive Value (Precision)
if (group_y_pred == 1).sum() > 0:
ppv = ((group_y_true == 1) & (group_y_pred == 1)).sum() / (group_y_pred == 1).sum()
else:
ppv = 0
# Negative Predictive Value
if (group_y_pred == 0).sum() > 0:
npv = ((group_y_true == 0) & (group_y_pred == 0)).sum() / (group_y_pred == 0).sum()
else:
npv = 0
results[group] = {'ppv': ppv, 'npv': npv}
return results
def comprehensive_bias_audit(self, model, X, y, protected_attributes):
"""Perform comprehensive bias audit"""
print("COMPREHENSIVE BIAS AUDIT")
print("=" * 50)
# Make predictions
y_pred = model.predict(X)
y_pred_proba = model.predict_proba(X)[:, 1] if hasattr(model, 'predict_proba') else None
audit_results = {}
for attr_name, attr_values in protected_attributes.items():
print(f"\nAnalyzing bias for: {attr_name}")
print("-" * 30)
# Statistical Parity
stat_parity = self.detect_statistical_parity_bias(y, y_pred, attr_values)
print("Statistical Parity:")
for group, rate in stat_parity['positive_rates'].items():
print(f" {group}: {rate:.3f} positive rate")
print("Disparate Impact Ratios:")
for group, impact in stat_parity['disparate_impacts'].items():
status = "PASS" if impact >= 0.8 else "FAIL"
print(f" {group}: {impact:.3f} ({status})")
# Equalized Odds
eq_odds = self.detect_equalized_odds_bias(y, y_pred, attr_values)
print("\nEqualized Odds:")
for group, metrics in eq_odds.items():
print(f" {group}: TPR={metrics['tpr']:.3f}, FPR={metrics['fpr']:.3f}")
# Predictive Parity
pred_parity = self.detect_predictive_parity_bias(y, y_pred, attr_values)
print("\nPredictive Parity:")
for group, metrics in pred_parity.items():
print(f" {group}: PPV={metrics['ppv']:.3f}, NPV={metrics['npv']:.3f}")
audit_results[attr_name] = {
'statistical_parity': stat_parity,
'equalized_odds': eq_odds,
'predictive_parity': pred_parity
}
return audit_results
def visualize_bias_analysis(self, df, protected_attr, outcome, predictions=None):
"""Create visualizations for bias analysis"""
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Outcome distribution by protected attribute
outcome_by_group = df.groupby(protected_attr)[outcome].agg(['mean', 'count'])
axes[0,0].bar(outcome_by_group.index, outcome_by_group['mean'])
axes[0,0].set_title(f'{outcome} Rate by {protected_attr}')
axes[0,0].set_ylabel('Positive Rate')
axes[0,0].tick_params(axis='x', rotation=45)
# 2. Sample size by group
axes[0,1].bar(outcome_by_group.index, outcome_by_group['count'])
axes[0,1].set_title(f'Sample Size by {protected_attr}')
axes[0,1].set_ylabel('Count')
axes[0,1].tick_params(axis='x', rotation=45)
# 3. Feature correlation with protected attribute
if 'salary' in df.columns:
df.boxplot(column='salary', by=protected_attr, ax=axes[0,2])
axes[0,2].set_title(f'Salary Distribution by {protected_attr}')
axes[0,2].set_ylabel('Salary')
# 4. Prediction accuracy by group (if predictions provided)
if predictions is not None:
accuracy_by_group = {}
for group in df[protected_attr].unique():
group_mask = df[protected_attr] == group
accuracy = (df.loc[group_mask, outcome] == predictions[group_mask]).mean()
accuracy_by_group[group] = accuracy
axes[1,0].bar(accuracy_by_group.keys(), accuracy_by_group.values())
axes[1,0].set_title(f'Prediction Accuracy by {protected_attr}')
axes[1,0].set_ylabel('Accuracy')
axes[1,0].tick_params(axis='x', rotation=45)
# 5. False positive/negative rates by group
if predictions is not None:
fp_rates = {}
fn_rates = {}
for group in df[protected_attr].unique():
group_mask = df[protected_attr] == group
group_true = df.loc[group_mask, outcome]
group_pred = predictions[group_mask]
# False Positive Rate
if (group_true == 0).sum() > 0:
fp_rate = ((group_true == 0) & (group_pred == 1)).sum() / (group_true == 0).sum()
else:
fp_rate = 0
# False Negative Rate
if (group_true == 1).sum() > 0:
fn_rate = ((group_true == 1) & (group_pred == 0)).sum() / (group_true == 1).sum()
else:
fn_rate = 0
fp_rates[group] = fp_rate
fn_rates[group] = fn_rate
x_pos = np.arange(len(fp_rates))
width = 0.35
axes[1,1].bar(x_pos - width/2, list(fp_rates.values()), width, label='False Positive Rate')
axes[1,1].bar(x_pos + width/2, list(fn_rates.values()), width, label='False Negative Rate')
axes[1,1].set_title(f'Error Rates by {protected_attr}')
axes[1,1].set_xticks(x_pos)
axes[1,1].set_xticklabels(fp_rates.keys(), rotation=45)
axes[1,1].legend()
# 6. Feature importance visualization (if model available)
axes[1,2].text(0.5, 0.5, 'Feature Importance\n(Requires Model)',
ha='center', va='center', transform=axes[1,2].transAxes)
axes[1,2].set_title('Feature Importance Analysis')
plt.tight_layout()
return fig
class FairnessAwareML:
"""Implementation of fairness-aware machine learning techniques"""
def __init__(self, fairness_constraint='statistical_parity'):
self.fairness_constraint = fairness_constraint
self.preprocessors = {}
self.postprocessors = {}
def fair_preprocessing_reweighting(self, X, y, protected_attribute):
"""
Preprocessing: Reweight training samples to achieve fairness
Based on Kamiran & Calders (2012)
"""
weights = np.ones(len(X))
# Calculate weights for each group
for group in protected_attribute.unique():
group_mask = protected_attribute == group
# Positive and negative class sizes in group
pos_in_group = ((y == 1) & group_mask).sum()
neg_in_group = ((y == 0) & group_mask).sum()
# Overall positive and negative class sizes
total_pos = (y == 1).sum()
total_neg = (y == 0).sum()
total_samples = len(y)
group_size = group_mask.sum()
# Expected number if perfectly balanced
expected_pos_in_group = (total_pos / total_samples) * group_size
expected_neg_in_group = (total_neg / total_samples) * group_size
# Calculate weights
if pos_in_group > 0:
pos_weight = expected_pos_in_group / pos_in_group
weights[(y == 1) & group_mask] = pos_weight
if neg_in_group > 0:
neg_weight = expected_neg_in_group / neg_in_group
weights[(y == 0) & group_mask] = neg_weight
return weights
def fair_postprocessing_threshold_optimization(self, y_true, y_pred_proba,
protected_attribute,
fairness_constraint='equalized_odds'):
"""
Postprocessing: Optimize thresholds per group to satisfy fairness constraints
"""
thresholds = {}
if fairness_constraint == 'statistical_parity':
# Find thresholds that equalize positive prediction rates
target_rate = y_pred_proba.mean() # Overall positive rate
for group in protected_attribute.unique():
group_mask = protected_attribute == group
group_proba = y_pred_proba[group_mask]
# Find threshold that achieves target rate
sorted_proba = np.sort(group_proba)
target_idx = int(len(sorted_proba) * (1 - target_rate))
thresholds[group] = sorted_proba[target_idx] if target_idx < len(sorted_proba) else 1.0
elif fairness_constraint == 'equalized_odds':
# Optimize thresholds to equalize TPR and FPR across groups
from scipy.optimize import minimize_scalar
def objective(threshold, group_data):
group_pred = (group_data['proba'] >= threshold).astype(int)
tpr = ((group_data['true'] == 1) & (group_pred == 1)).sum() / max(1, (group_data['true'] == 1).sum())
fpr = ((group_data['true'] == 0) & (group_pred == 1)).sum() / max(1, (group_data['true'] == 0).sum())
return abs(tpr - 0.8) + abs(fpr - 0.2) # Target TPR=0.8, FPR=0.2
for group in protected_attribute.unique():
group_mask = protected_attribute == group
group_data = {
'true': y_true[group_mask],
'proba': y_pred_proba[group_mask]
}
result = minimize_scalar(objective, args=(group_data,), bounds=(0, 1), method='bounded')
thresholds[group] = result.x
return thresholds
def apply_fair_thresholds(self, y_pred_proba, protected_attribute, thresholds):
"""Apply group-specific thresholds"""
y_pred_fair = np.zeros_like(y_pred_proba, dtype=int)
for group, threshold in thresholds.items():
group_mask = protected_attribute == group
y_pred_fair[group_mask] = (y_pred_proba[group_mask] >= threshold).astype(int)
return y_pred_fair
def adversarial_debiasing(self, X_train, y_train, protected_train,
X_test, y_test, protected_test,
lambda_fairness=1.0):
"""
Adversarial debiasing approach
Simplified implementation - in practice would use deep learning frameworks
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
print("Training adversarial debiasing model...")
# Main classifier
main_classifier = LogisticRegression(random_state=42)
# Adversarial classifier (tries to predict protected attribute from predictions)
adversarial_classifier = LogisticRegression(random_state=42)
# Iterative training (simplified)
for iteration in range(5):
# Train main classifier
main_classifier.fit(X_train, y_train)
# Get predictions for adversarial training
main_predictions = main_classifier.predict_proba(X_train)[:, 1].reshape(-1, 1)
# Train adversarial classifier
le = LabelEncoder()
protected_encoded = le.fit_transform(protected_train)
adversarial_classifier.fit(main_predictions, protected_encoded)
# Calculate adversarial loss (simplified)
adv_predictions = adversarial_classifier.predict(main_predictions)
adv_accuracy = accuracy_score(protected_encoded, adv_predictions)
print(f"Iteration {iteration + 1}: Adversarial accuracy = {adv_accuracy:.3f}")
# In real implementation, would update main classifier weights
# to minimize main loss + lambda_fairness * adversarial_accuracy
# Evaluate on test set
test_predictions = main_classifier.predict(X_test)
test_proba = main_classifier.predict_proba(X_test)[:, 1]
return main_classifier, test_predictions, test_proba
def demonstrate_fairness_techniques():
"""Demonstrate various fairness techniques"""
print("FAIRNESS-AWARE ML DEMONSTRATION")
print("=" * 50)
# Create bias detection framework
bias_detector = BiasDetectionFramework()
# Generate biased dataset
df = bias_detector.generate_biased_dataset(n_samples=5000)
print("Generated dataset with embedded bias patterns")
print(f"Dataset shape: {df.shape}")
# Prepare data for modeling
# Encode categorical variables
le_gender = LabelEncoder()
le_race = LabelEncoder()
le_education = LabelEncoder()
df['gender_encoded'] = le_gender.fit_transform(df['gender'])
df['race_encoded'] = le_race.fit_transform(df['race'])
df['education_encoded'] = le_education.fit_transform(df['education_level'])
# Features (excluding protected attributes for "fair" model)
feature_columns = ['age', 'education_encoded', 'experience_years', 'salary']
X = df[feature_columns]
y = df['high_performer']
# Protected attributes
protected_attrs = {
'gender': df['gender'],
'race': df['race']
}
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
# Get corresponding protected attributes for splits
train_idx = X_train.index
test_idx = X_test.index
protected_train = {attr: values.loc[train_idx] for attr, values in protected_attrs.items()}
protected_test = {attr: values.loc[test_idx] for attr, values in protected_attrs.items()}
# 1. Train standard model (potentially biased)
print("\n1. STANDARD MODEL (Potentially Biased)")
print("-" * 40)
standard_model = RandomForestClassifier(n_estimators=100, random_state=42)
standard_model.fit(X_train, y_train)
# Perform bias audit
bias_audit = bias_detector.comprehensive_bias_audit(
standard_model, X_test, y_test, protected_test
)
# 2. Fair preprocessing
print("\n2. FAIR PREPROCESSING (Reweighting)")
print("-" * 40)
fairness_ml = FairnessAwareML()
# Calculate sample weights
sample_weights = fairness_ml.fair_preprocessing_reweighting(
X_train, y_train, protected_train['gender']
)
# Train model with reweighted samples
fair_preprocessed_model = RandomForestClassifier(n_estimators=100, random_state=42)
fair_preprocessed_model.fit(X_train, y_train, sample_weight=sample_weights)
# Audit fair preprocessed model
fair_preprocessed_audit = bias_detector.comprehensive_bias_audit(
fair_preprocessed_model, X_test, y_test, protected_test
)
# 3. Fair postprocessing
print("\n3. FAIR POSTPROCESSING (Threshold Optimization)")
print("-" * 40)
# Get probabilities from standard model
y_pred_proba = standard_model.predict_proba(X_test)[:, 1]
# Optimize thresholds for fairness
fair_thresholds = fairness_ml.fair_postprocessing_threshold_optimization(
y_test, y_pred_proba, protected_test['gender'],
fairness_constraint='statistical_parity'
)
print(f"Fair thresholds: {fair_thresholds}")
# Apply fair thresholds
y_pred_fair = fairness_ml.apply_fair_thresholds(
y_pred_proba, protected_test['gender'], fair_thresholds
)
# Create dummy model for audit (using fair predictions)
class FairThresholdModel:
def __init__(self, base_model, thresholds, protected_attr):
self.base_model = base_model
self.thresholds = thresholds
self.protected_attr = protected_attr
def predict(self, X):
proba = self.base_model.predict_proba(X)[:, 1]
return fairness_ml.apply_fair_thresholds(proba, self.protected_attr, self.thresholds)
def predict_proba(self, X):
return self.base_model.predict_proba(X)
fair_postprocessed_model = FairThresholdModel(
standard_model, fair_thresholds, protected_test['gender']
)
# Audit fair postprocessed model
fair_postprocessed_audit = bias_detector.comprehensive_bias_audit(
fair_postprocessed_model, X_test, y_test, protected_test
)
return {
'dataset': df,
'standard_audit': bias_audit,
'fair_preprocessed_audit': fair_preprocessed_audit,
'fair_postprocessed_audit': fair_postprocessed_audit,
'models': {
'standard': standard_model,
'fair_preprocessed': fair_preprocessed_model,
'fair_postprocessed': fair_postprocessed_model
}
}class ExplainableAI:
"""Comprehensive framework for model interpretability and explainability"""
def __init__(self):
self.explanations = {}
self.global_explanations = {}
def feature_importance_analysis(self, model, X, feature_names=None):
"""Comprehensive feature importance analysis"""
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
importance_results = {}
# 1. Model-specific feature importance
if hasattr(model, 'feature_importances_'):
importance_results['model_specific'] = {
'importance': model.feature_importances_,
'features': feature_names
}
# 2. Permutation importance
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(
model, X, y, n_repeats=10, random_state=42, n_jobs=-1
)
importance_results['permutation'] = {
'importance_mean': perm_importance.importances_mean,
'importance_std': perm_importance.importances_std,
'features': feature_names
}
return importance_results
def shap_explanations(self, model, X_train, X_test, feature_names=None):
"""Generate SHAP explanations for model predictions"""
try:
import shap
except ImportError:
print("SHAP not available. Install with: pip install shap")
return None
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]
# Choose appropriate explainer
if hasattr(model, 'tree_'):
# Tree-based models
explainer = shap.TreeExplainer(model)
else:
# Model-agnostic explainer
explainer = shap.Explainer(model, X_train)
# Generate SHAP values
shap_values = explainer.shap_values(X_test)
# Handle multi-class case
if isinstance(shap_values, list):
shap_values = shap_values[1] # Use positive class for binary classification
explanation_results = {
'shap_values': shap_values,
'expected_value': explainer.expected_value,
'feature_names': feature_names,
'explainer': explainer
}
return explanation_results
def lime_explanations(self, model, X_train, X_test, instance_idx=0,
feature_names=None, mode='classification'):
"""Generate LIME explanations for individual predictions"""
try:
from lime import lime_tabular
except ImportError:
print("LIME not available. Install with: pip install lime")
return None
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]
# Create LIME explainer
explainer = lime_tabular.LimeTabularExplainer(
X_train.values if hasattr(X_train, 'values') else X_train,
feature_names=feature_names,
mode=mode,
random_state=42
)
# Generate explanation for specific instance
if mode == 'classification':
explanation = explainer.explain_instance(
X_test.iloc[instance_idx].values if hasattr(X_test, 'iloc') else X_test[instance_idx],
model.predict_proba,
num_features=len(feature_names)
)
else:
explanation = explainer.explain_instance(
X_test.iloc[instance_idx].values if hasattr(X_test, 'iloc') else X_test[instance_idx],
model.predict,
num_features=len(feature_names)
)
return explanation
def partial_dependence_analysis(self, model, X, features, feature_names=None):
"""Generate partial dependence plots"""
from sklearn.inspection import partial_dependence, plot_partial_dependence
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
pd_results = {}
for feature_idx in features:
if isinstance(feature_idx, int):
feature_name = feature_names[feature_idx]
pd_data = partial_dependence(model, X, [feature_idx])
pd_results[feature_name] = {
'partial_dependence': pd_data[0][0],
'values': pd_data[1][0]
}
return pd_results
def counterfactual_explanations(self, model, X_train, instance,
desired_class=1, max_iterations=1000):
"""
Generate counterfactual explanations
Simplified implementation - in practice, use specialized libraries like DiCE
"""
from sklearn.neighbors import NearestNeighbors
# Find similar instances with desired outcome
current_prediction = model.predict([instance])[0]
if current_prediction == desired_class:
return {"message": "Instance already has desired class"}
# Get predictions for all training instances
train_predictions = model.predict(X_train)
desired_instances = X_train[train_predictions == desired_class]
if len(desired_instances) == 0:
return {"message": "No instances found with desired class"}
# Find nearest neighbor with desired class
nn = NearestNeighbors(n_neighbors=1)
nn.fit(desired_instances)
distances, indices = nn.kneighbors([instance])
nearest_counterfactual = desired_instances.iloc[indices[0][0]]
# Calculate feature changes needed
feature_changes = {}
for i, feature_name in enumerate(X_train.columns):
original_value = instance[i]
counterfactual_value = nearest_counterfactual.iloc[i];
if abs(original_value - counterfactual_value) > 1e-6:
feature_changes[feature_name] = {
'original': original_value,
'counterfactual': counterfactual_value,
'change': counterfactual_value - original_value
}
return {
'counterfactual_instance': nearest_counterfactual,
'feature_changes': feature_changes,
'distance': distances[0][0]
}
def model_behavior_analysis(self, model, X, y, feature_names=None):
"""Analyze overall model behavior patterns"""
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
analysis = {}
# 1. Prediction distribution
y_pred = model.predict(X)
y_pred_proba = model.predict_proba(X)[:, 1] if hasattr(model, 'predict_proba') else None
analysis['prediction_distribution'] = {
'class_distribution': pd.Series(y_pred).value_counts().to_dict(),
'confidence_stats': {
'mean_confidence': y_pred_proba.mean() if y_pred_proba is not None else None,
'std_confidence': y_pred_proba.std() if y_pred_proba is not None else None
}
}
# 2. Feature utilization
if hasattr(model, 'feature_importances_'):
feature_utilization = dict(zip(feature_names, model.feature_importances_))
analysis['feature_utilization'] = feature_utilization
# 3. Decision boundary analysis (for 2D case)
if X.shape[1] == 2:
analysis['decision_boundary'] = self._analyze_decision_boundary(model, X, y)
return analysis
def _analyze_decision_boundary(self, model, X, y):
"""Analyze decision boundary characteristics"""
# Create mesh for decision boundary
h = 0.02 # Step size
x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Make predictions on mesh
mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z = model.predict_proba(mesh_points)[:, 1]
Z = Z.reshape(xx.shape)
return {
'mesh_x': xx,
'mesh_y': yy,
'decision_surface': Z
}
def generate_explanation_report(self, model, X_train, X_test, y_test,
feature_names=None, instance_idx=0):
"""Generate comprehensive explanation report"""
print("GENERATING EXPLANATION REPORT")
print("=" * 50)
if feature_names is None:
feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]
report = {}
# 1. Global explanations
print("1. Analyzing global model behavior...")
# Feature importance
importance_results = self.feature_importance_analysis(model, X_test, feature_names)
report['feature_importance'] = importance_results
# Model behavior
behavior_analysis = self.model_behavior_analysis(model, X_test, y_test, feature_names)
report['model_behavior'] = behavior_analysis
# 2. Local explanations
print("2. Generating local explanations...")
# SHAP explanations
shap_results = self.shap_explanations(model, X_train, X_test[:100], feature_names)
if shap_results:
report['shap_explanations'] = shap_results
# LIME explanation for specific instance
lime_explanation = self.lime_explanations(
model, X_train, X_test, instance_idx, feature_names
)
if lime_explanation:
report['lime_explanation'] = lime_explanation
# 3. Counterfactual explanations
print("3. Generating counterfactual explanations...")
instance = X_test.iloc[instance_idx]
counterfactual = self.counterfactual_explanations(
model, X_train, instance, desired_class=1
)
report['counterfactual'] = counterfactual
print("Explanation report generated successfully!")
return report
def visualize_explanations(self, explanation_report, figsize=(20, 15)):
"""Create comprehensive visualization of explanations"""
fig, axes = plt.subplots(3, 3, figsize=figsize)
# 1. Feature importance comparison
if 'feature_importance' in explanation_report:
importance_data = explanation_report['feature_importance']
if 'model_specific' in importance_data:
model_importance = importance_data['model_specific']
axes[0,0].barh(model_importance['features'], model_importance['importance'])
axes[0,0].set_title('Model-Specific Feature Importance')
if 'permutation' in importance_data:
perm_importance = importance_data['permutation']
axes[0,1].barh(perm_importance['features'], perm_importance['importance_mean'])
axes[0,1].set_title('Permutation Feature Importance')
# 2. SHAP summary plot
if 'shap_explanations' in explanation_report:
try:
import shap
shap_data = explanation_report['shap_explanations']
# SHAP summary plot (simplified)
mean_abs_shap = np.mean(np.abs(shap_data['shap_values']), axis=0)
axes[0,2].barh(shap_data['feature_names'], mean_abs_shap)
axes[0,2].set_title('Mean |SHAP| Values')
# SHAP waterfall plot for first instance (simplified)
if len(shap_data['shap_values']) > 0:
instance_shap = shap_data['shap_values'][0]
axes[1,0].barh(shap_data['feature_names'], instance_shap)
axes[1,0].set_title('SHAP Values - Instance 0')
except ImportError:
axes[0,2].text(0.5, 0.5, 'SHAP not available',
ha='center', va='center', transform=axes[0,2].transAxes)
# 3. Model behavior analysis
if 'model_behavior' in explanation_report:
behavior = explanation_report['model_behavior']
# Prediction distribution
if 'prediction_distribution' in behavior:
pred_dist = behavior['prediction_distribution']['class_distribution']
axes[1,1].bar(pred_dist.keys(), pred_dist.values())
axes[1,1].set_title('Prediction Distribution')
# 4. Counterfactual analysis
if 'counterfactual' in explanation_report:
cf_data = explanation_report['counterfactual']
if 'feature_changes' in cf_data:
changes = cf_data['feature_changes']
features = list(changes.keys())
change_values = [changes[f]['change'] for f in features]
axes[1,2].barh(features, change_values)
axes[1,2].set_title('Counterfactual Feature Changes')
# Fill remaining subplots with placeholder text
for i in range(2, 3):
for j in range(3):
if i == 2:
axes[i,j].text(0.5, 0.5, f'Additional Analysis\nSlot {i},{j}',
ha='center', va='center', transform=axes[i,j].transAxes)
axes[i,j].set_title(f'Analysis Slot {i},{j}')
plt.tight_layout()
return fig
def demonstrate_explainable_ai():
"""Demonstrate explainable AI techniques"""
print("EXPLAINABLE AI DEMONSTRATION")
print("=" * 50)
# Generate sample data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_informative=7,
n_redundant=2, n_clusters_per_class=1,
class_sep=0.8, random_state=42)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X_df = pd.DataFrame(X, columns=feature_names)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2,
random_state=42, stratify=y)
# Train model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Create explainer
explainer = ExplainableAI()
# Generate comprehensive explanation report
explanation_report = explainer.generate_explanation_report(
model, X_train, X_test, y_test, feature_names, instance_idx=5
)
# Create visualizations
fig = explainer.visualize_explanations(explanation_report)
return explanation_report, figclass ProductionDeploymentFramework:
"""Framework for safe machine learning model deployment"""
def __init__(self, model_name, version="1.0"):
self.model_name = model_name
self.version = version
self.deployment_config = {}
self.monitoring_config = {}
def pre_deployment_checklist(self, model, X_test, y_test, business_requirements):
"""Comprehensive pre-deployment validation checklist"""
checklist_results = {
'performance_validation': False,
'bias_audit_passed': False,
'interpretability_check': False,
'robustness_test': False,
'business_requirements_met': False,
'technical_requirements_met': False,
'security_audit_passed': False,
'documentation_complete': False
}
# 1. Performance Validation
print("1. PERFORMANCE VALIDATION")
print("-" * 30)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
performance_metrics = {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred, average='weighted'),
'recall': recall_score(y_test, y_pred, average='weighted'),
'f1_score': f1_score(y_test, y_pred, average='weighted')
}
# Check against business requirements
min_performance = business_requirements.get('min_performance', {})
performance_passed = all(
performance_metrics.get(metric, 0) >= threshold
for metric, threshold in min_performance.items()
)
checklist_results['performance_validation'] = performance_passed
print(f"Performance validation: {'PASS' if performance_passed else 'FAIL'}")
for metric, value in performance_metrics.items():
min_req = min_performance.get(metric, 'N/A')
status = 'PASS' if min_req != 'N/A' and value >= min_req else 'FAIL'
print(f" {metric}: {value:.4f} (min: {min_req}) - {status}")
# 2. Bias Audit
print(f"\n2. BIAS AUDIT")
print("-" * 30)
# Simplified bias check - in practice, use comprehensive framework
# Check if model predictions are relatively balanced
prediction_balance = abs(y_pred.mean() - 0.5)
bias_threshold = 0.3 # Allow up to 30% imbalance
bias_passed = prediction_balance <= bias_threshold
checklist_results['bias_audit_passed'] = bias_passed
print(f"Bias audit: {'PASS' if bias_passed else 'FAIL'}")
print(f" Prediction balance: {prediction_balance:.3f} (threshold: {bias_threshold})")
# 3. Robustness Testing
print(f"\n3. ROBUSTNESS TESTING")
print("-" * 30)
robustness_results = self._test_model_robustness(model, X_test, y_test)
robustness_passed = robustness_results['adversarial_robustness'] > 0.8
checklist_results['robustness_test'] = robustness_passed
print(f"Robustness test: {'PASS' if robustness_passed else 'FAIL'}")
print(f" Adversarial robustness: {robustness_results['adversarial_robustness']:.3f}")
# 4. Technical Requirements
print(f"\n4. TECHNICAL REQUIREMENTS")
print("-" * 30)
technical_checks = self._validate_technical_requirements(model, business_requirements)
checklist_results['technical_requirements_met'] = technical_checks['all_passed']
print(f"Technical requirements: {'PASS' if technical_checks['all_passed'] else 'FAIL'}")
for check, result in technical_checks.items():
if check != 'all_passed':
print(f" {check}: {'PASS' if result else 'FAIL'}")
# Summary
print(f"\n5. DEPLOYMENT READINESS SUMMARY")
print("-" * 40)
total_checks = len(checklist_results)
passed_checks = sum(checklist_results.values())
readiness_score = passed_checks / total_checks
print(f"Readiness score: {readiness_score:.1%} ({passed_checks}/{total_checks})")
if readiness_score >= 0.8:
deployment_recommendation = "APPROVED for deployment"
elif readiness_score >= 0.6:
deployment_recommendation = "CONDITIONAL approval - address failed checks"
else:
deployment_recommendation = "NOT APPROVED - significant issues found"
print(f"Recommendation: {deployment_recommendation}")
return {
'checklist_results': checklist_results,
'readiness_score': readiness_score,
'recommendation': deployment_recommendation,
'performance_metrics': performance_metrics
}
def _test_model_robustness(self, model, X_test, y_test, noise_levels=[0.01, 0.05, 0.1]):
"""Test model robustness to input perturbations"""
original_accuracy = model.score(X_test, y_test)
robustness_scores = []
for noise_level in noise_levels:
# Add Gaussian noise
X_noisy = X_test + np.random.normal(0, noise_level, X_test.shape)
noisy_accuracy = model.score(X_noisy, y_test)
robustness_score = noisy_accuracy / original_accuracy
robustness_scores.append(robustness_score)
return {
'adversarial_robustness': np.mean(robustness_scores),
'robustness_by_noise_level': dict(zip(noise_levels, robustness_scores))
}
def _validate_technical_requirements(self, model, requirements):
"""Validate technical deployment requirements"""
checks = {}
# Memory usage check
model_size_mb = self._estimate_model_size(model)
max_size_mb = requirements.get('max_model_size_mb', 100)
checks['memory_usage'] = model_size_mb <= max_size_mb
# Prediction latency check
latency_ms = self._measure_prediction_latency(model)
max_latency_ms = requirements.get('max_latency_ms', 100)
checks['latency'] = latency_ms <= max_latency_ms
# Feature requirements check
required_features = requirements.get('required_features', [])
if hasattr(model, 'feature_names_in_'):
model_features = set(model.feature_names_in_)
checks['feature_availability'] = set(required_features).issubset(model_features)
else:
checks['feature_availability'] = True # Cannot verify
# Thread safety check (simplified)
checks['thread_safety'] = True # Most sklearn models are thread-safe for prediction
checks['all_passed'] = all(checks.values())
return checks
def _estimate_model_size(self, model):
"""Estimate model size in MB"""
import pickle
model_bytes = len(pickle.dumps(model))
return model_bytes / (1024 * 1024)
def _measure_prediction_latency(self, model, n_samples=1000):
"""Measure average prediction latency"""
import time
# Create sample data
if hasattr(model, 'n_features_in_'):
n_features = model.n_features_in_
else:
n_features = 10 # Default
X_sample = np.random.randn(n_samples, n_features)
# Measure prediction time
start_time = time.time()
_ = model.predict(X_sample)
end_time = time.time()
total_time_ms = (end_time - start_time) * 1000
avg_latency_ms = total_time_ms / n_samples
return avg_latency_ms
def canary_deployment_strategy(self, old_model, new_model, X_test, y_test,
traffic_split=0.1, success_threshold=0.95):
"""
Implement canary deployment strategy
"""
print("CANARY DEPLOYMENT STRATEGY")
print("=" * 40)
n_samples = len(X_test)
canary_size = int(n_samples * traffic_split)
# Split test data
canary_indices = np.random.choice(n_samples, canary_size, replace=False)
canary_mask = np.zeros(n_samples, dtype=bool)
canary_mask[canary_indices] = True
X_canary = X_test[canary_mask]
y_canary = y_test[canary_mask]
X_control = X_test[~canary_mask]
y_control = y_test[~canary_mask]
print(f"Canary group size: {len(X_canary)} ({traffic_split:.1%})")
print(f"Control group size: {len(X_control)} ({1-traffic_split:.1%})")
# Evaluate both models
old_model_performance = old_model.score(X_control, y_control)
new_model_performance = new_model.score(X_canary, y_canary)
print(f"Old model (control) accuracy: {old_model_performance:.4f}")
print(f"New model (canary) accuracy: {new_model_performance:.4f}")
# Decision logic
relative_performance = new_model_performance / old_model_performance
if relative_performance >= success_threshold:
decision = "PROCEED with full deployment"
status = "SUCCESS"
else:
decision = "ROLLBACK - performance degradation detected"
status = "FAILURE"
print(f"Relative performance: {relative_performance:.4f}")
print(f"Decision: {decision}")
return {
'status': status,
'old_model_performance': old_model_performance,
'new_model_performance': new_model_performance,
'relative_performance': relative_performance,
'decision': decision
}
def blue_green_deployment_strategy(self, blue_model, green_model, X_test, y_test,
performance_threshold=0.02):
"""
Implement blue-green deployment strategy
"""
print("BLUE-GREEN DEPLOYMENT STRATEGY")
print("=" * 40)
# Evaluate both environments
blue_performance = blue_model.score(X_test, y_test)
green_performance = green_model.score(X_test, y_test)
print(f"Blue environment performance: {blue_performance:.4f}")
print(f"Green environment performance: {green_performance:.4f}")
performance_diff = green_performance - blue_performance
if performance_diff >= -performance_threshold: # Allow small degradation
decision = "SWITCH to green environment"
active_model = green_model
status = "SUCCESS"
else:
decision = "KEEP blue environment active"
active_model = blue_model
status = "ROLLBACK"
print(f"Performance difference: {performance_diff:+.4f}")
print(f"Decision: {decision}")
return {
'status': status,
'active_model': active_model,
'blue_performance': blue_performance,
'green_performance': green_performance,
'performance_diff': performance_diff,
'decision': decision
}
def a_b_testing_framework(self, model_a, model_b, X_test, y_test,
confidence_level=0.95, min_sample_size=100):
"""
Implement A/B testing framework for model comparison
"""
print("A/B TESTING FRAMEWORK")
print("=" * 30)
from scipy.stats import ttest_ind
# Randomly split traffic
n_samples = len(X_test)
if n_samples < 2 * min_sample_size:
return {"error": f"Insufficient samples. Need at least {2 * min_sample_size}"}
# Random assignment
assignment = np.random.choice(['A', 'B'], n_samples)
X_a = X_test[assignment == 'A']
y_a = y_test[assignment == 'A']
X_b = X_test[assignment == 'B']
y_b = y_test[assignment == 'B']
# Calculate individual prediction accuracies
pred_a = model_a.predict(X_a)
pred_b = model_b.predict(X_b)
accuracy_a = (pred_a == y_a).astype(float)
accuracy_b = (pred_b == y_b).astype(float)
# Statistical test
t_stat, p_value = ttest_ind(accuracy_a, accuracy_b)
alpha = 1 - confidence_level
is_significant = p_value < alpha
print(f"Model A performance: {accuracy_a.mean():.4f} ± {accuracy_a.std():.4f} (n={len(accuracy_a)})")
print(f"Model B performance: {accuracy_b.mean():.4f} ± {accuracy_b.std():.4f} (n={len(accuracy_b)})")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant difference: {is_significant} (α={alpha})")
if is_significant:
winner = 'A' if accuracy_a.mean() > accuracy_b.mean() else 'B'
recommendation = f"Deploy Model {winner}"
else:
recommendation = "No significant difference - choose based on other criteria"
print(f"Recommendation: {recommendation}")
return {
'model_a_performance': accuracy_a.mean(),
'model_b_performance': accuracy_b.mean(),
't_statistic': t_stat,
'p_value': p_value,
'is_significant': is_significant,
'recommendation': recommendation,
'sample_sizes': {'A': len(accuracy_a), 'B': len(accuracy_b)}
}
class ModelGovernanceFramework:
"""Comprehensive model governance and compliance framework"""
def __init__(self):
self.governance_policies = {}
self.compliance_requirements = {}
self.audit_trail = []
def define_governance_policies(self):
"""Define comprehensive governance policies"""
policies = {
"model_development": {
"code_review_required": True,
"documentation_requirements": [
"Model architecture description",
"Training data description",
"Performance evaluation report",
"Bias and fairness analysis",
"Business impact assessment"
],
"testing_requirements": [
"Unit tests for preprocessing",
"Integration tests for pipeline",
"Performance benchmarks",
"Robustness tests",
"Bias audits"
]
},
"data_governance": {
"data_quality_checks": True,
"privacy_compliance": True,
"data_lineage_tracking": True,
"consent_management": True,
"retention_policies": {
"training_data": "3 years",
"prediction_logs": "1 year",
"model_artifacts": "5 years"
}
},
"deployment_governance": {
"staging_approval_required": True,
"production_approval_required": True,
"rollback_procedures": True,
"monitoring_requirements": [
"Performance monitoring",
"Data drift detection",
"Bias monitoring",
"Business metrics tracking"
]
},
"operational_governance": {
"incident_response_procedures": True,
"regular_model_reviews": "quarterly",
"retraining_triggers": [
"Performance degradation > 5%",
"Data drift detected",
"Bias threshold exceeded",
"Business requirements change"
],
"access_controls": {
"model_artifacts": "role_based",
"training_data": "restricted",
"prediction_logs": "audited_access"
}
}
}
self.governance_policies = policies
return policies
def regulatory_compliance_framework(self):
"""Framework for regulatory compliance (GDPR, CCPA, AI Act, etc.)"""
compliance_framework = {
"GDPR": {
"requirements": [
"Right to explanation for automated decisions",
"Data minimization principle",
"Consent for data processing",
"Right to be forgotten",
"Data protection by design"
],
"implementation": {
"explainable_ai": "Required for high-impact decisions",
"data_anonymization": "PII must be anonymized/pseudonymized",
"consent_tracking": "User consent must be tracked and auditable",
"deletion_procedures": "Ability to delete user data on request",
"privacy_by_design": "Privacy considerations in model design"
}
},
"CCPA": {
"requirements": [
"Right to know what data is collected",
"Right to delete personal information",
"Right to opt-out of sale",
"Non-discrimination for exercising rights"
],
"implementation": {
"data_inventory": "Catalog all personal data used in models",
"deletion_capabilities": "Technical ability to delete user data",
"opt_out_mechanisms": "Allow users to opt-out of data processing",
"non_discrimination": "Equal service regardless of privacy choices"
}
},
"AI_Act_EU": {
"risk_categories": {
"prohibited": ["Social scoring", "Subliminal techniques"],
"high_risk": ["Employment decisions", "Credit scoring", "Healthcare"],
"limited_risk": ["Chatbots", "Deepfakes"],
"minimal_risk": ["AI-enabled games", "Spam filters"]
},
"high_risk_requirements": [
"Risk management system",
"High-quality training data",
"Logging and traceability",
"Transparency and user information",
"Human oversight",
"Accuracy and robustness"
]
},
"Algorithmic_Accountability": {
"fairness_requirements": [
"Regular bias audits",
"Disparate impact analysis",
"Equalized odds assessment",
"Demographic parity checks"
],
"transparency_requirements": [
"Model documentation",
"Decision logic explanation",
"Performance metrics disclosure",
"Limitation acknowledgment"
]
}
}
self.compliance_requirements = compliance_framework
return compliance_framework
def create_model_card(self, model_info, performance_metrics, bias_analysis,
intended_use, limitations):
"""Create standardized model card for documentation"""
model_card = {
"model_details": {
"name": model_info.get('name', 'Unnamed Model'),
"version": model_info.get('version', '1.0'),
"date": model_info.get('date', pd.Timestamp.now().strftime('%Y-%m-%d')),
"type": model_info.get('type', 'Classification'),
"architecture": model_info.get('architecture', 'Unknown'),
"developers": model_info.get('developers', []),
"contact": model_info.get('contact', '')
},
"intended_use": {
"primary_use_cases": intended_use.get('primary_use_cases', []),
"out_of_scope_uses": intended_use.get('out_of_scope_uses', []),
"target_users": intended_use.get('target_users', []),
"deployment_context": intended_use.get('deployment_context', '')
},
"performance": {
"metrics": performance_metrics,
"test_data_description": model_info.get('test_data_description', ''),
"evaluation_procedure": model_info.get('evaluation_procedure', '')
},
"bias_analysis": {
"protected_attributes": bias_analysis.get('protected_attributes', []),
"fairness_metrics": bias_analysis.get('fairness_metrics', {}),
"bias_mitigation": bias_analysis.get('bias_mitigation', []),
"limitations": bias_analysis.get('limitations', [])
},
"training_data": {
"description": model_info.get('training_data_description', ''),
"size": model_info.get('training_data_size', ''),
"preprocessing": model_info.get('preprocessing_steps', []),
"known_biases": model_info.get('known_biases', [])
},
"limitations_and_risks": {
"known_limitations": limitations.get('known_limitations', []),
"potential_risks": limitations.get('potential_risks', []),
"mitigation_strategies": limitations.get('mitigation_strategies', []),
"monitoring_plan": limitations.get('monitoring_plan', [])
},
"recommendations": {
"usage_guidelines": model_info.get('usage_guidelines', []),
"monitoring_requirements": model_info.get('monitoring_requirements', []),
"update_schedule": model_info.get('update_schedule', ''),
"feedback_mechanisms": model_info.get('feedback_mechanisms', [])
}
}
return model_card
def audit_trail_management(self, action, user, model_id, details=None):
"""Manage audit trail for model governance"""
audit_entry = {
'timestamp': pd.Timestamp.now(),
'action': action,
'user': user,
'model_id': model_id,
'details': details or {},
'audit_id': len(self.audit_trail) + 1
}
self.audit_trail.append(audit_entry)
# Log critical actions
critical_actions = ['deploy', 'rollback', 'data_access', 'model_update']
if action in critical_actions:
print(f"AUDIT LOG: {action} by {user} on model {model_id} at {audit_entry['timestamp']}")
return audit_entry['audit_id']
def compliance_assessment(self, model_info, deployment_context):
"""Assess compliance with regulatory requirements"""
assessment_results = {}
# Determine applicable regulations
applicable_regulations = []
if deployment_context.get('geographic_scope') in ['EU', 'European Union']:
applicable_regulations.extend(['GDPR', 'AI_Act_EU'])
if deployment_context.get('geographic_scope') in ['CA', 'California', 'US']:
applicable_regulations.append('CCPA')
if deployment_context.get('use_case') in ['hiring', 'lending', 'healthcare']:
applicable_regulations.append('Algorithmic_Accountability')
# Assess compliance for each regulation
for regulation in applicable_regulations:
if regulation in self.compliance_requirements:
compliance_check = self._assess_regulation_compliance(
regulation, model_info, deployment_context
)
assessment_results[regulation] = compliance_check
# Overall compliance score
total_checks = sum(len(result['checks']) for result in assessment_results.values())
passed_checks = sum(
sum(result['checks'].values()) for result in assessment_results.values()
)
overall_score = passed_checks / total_checks if total_checks > 0 else 0
return {
'applicable_regulations': applicable_regulations,
'assessment_results': assessment_results,
'overall_compliance_score': overall_score,
'recommendations': self._generate_compliance_recommendations(assessment_results)
}
def _assess_regulation_compliance(self, regulation, model_info, deployment_context):
"""Assess compliance with specific regulation"""
checks = {}
if regulation == 'GDPR':
checks['explainable_ai'] = model_info.get('explainable', False)
checks['data_minimization'] = model_info.get('data_minimized', False)
checks['consent_tracking'] = model_info.get('consent_managed', False)
checks['deletion_capability'] = model_info.get('deletion_supported', False)
elif regulation == 'AI_Act_EU':
risk_level = self._determine_ai_act_risk_level(deployment_context)
if risk_level == 'high_risk':
checks['risk_management'] = model_info.get('risk_management_system', False)
checks['quality_training_data'] = model_info.get('high_quality_data', False)
checks['logging_traceability'] = model_info.get('logging_enabled', False)
checks['human_oversight'] = model_info.get('human_oversight', False)
elif regulation == 'Algorithmic_Accountability':
checks['bias_audit'] = model_info.get('bias_audited', False)
checks['fairness_metrics'] = model_info.get('fairness_assessed', False)
checks['transparency'] = model_info.get('transparent_documentation', False)
compliance_score = sum(checks.values()) / len(checks) if checks else 1.0
return {
'regulation': regulation,
'checks': checks,
'compliance_score': compliance_score,
'status': 'COMPLIANT' if compliance_score >= 0.8 else 'NON_COMPLIANT'
}
def _determine_ai_act_risk_level(self, deployment_context):
"""Determine AI Act risk level based on deployment context"""
use_case = deployment_context.get('use_case', '').lower()
high_risk_cases = ['employment', 'hiring', 'credit', 'lending', 'healthcare',
'education', 'law_enforcement']
if any(case in use_case for case in high_risk_cases):
return 'high_risk'
return 'limited_risk'
def _generate_compliance_recommendations(self, assessment_results):
"""Generate recommendations based on compliance assessment"""
recommendations = []
for regulation, result in assessment_results.items():
if result['compliance_score'] < 0.8:
failed_checks = [check for check, passed in result['checks'].items() if not passed]
for check in failed_checks:
if check == 'explainable_ai':
recommendations.append("Implement explainable AI techniques (SHAP, LIME)")
elif check == 'bias_audit':
recommendations.append("Conduct comprehensive bias audit across protected attributes")
elif check == 'human_oversight':
recommendations.append("Implement human-in-the-loop oversight for high-risk decisions")
elif check == 'logging_traceability':
recommendations.append("Enable comprehensive logging and audit trails")
return recommendations
def demonstrate_governance_framework():
"""Demonstrate model governance framework"""
print("MODEL GOVERNANCE FRAMEWORK DEMONSTRATION")
print("=" * 50)
# Initialize governance framework
governance = ModelGovernanceFramework()
# Define policies
policies = governance.define_governance_policies()
print("1. Governance policies defined")
# Define compliance framework
compliance_framework = governance.regulatory_compliance_framework()
print("2. Compliance framework established")
# Example model info for model card creation
model_info = {
'name': 'Credit Risk Model',
'version': '2.1',
'type': 'Binary Classification',
'architecture': 'Random Forest',
'developers': ['Data Science Team'],
'contact': ''
}
performance_metrics = {
'accuracy': 0.87,
'precision': 0.82,
'recall': 0.79,
'f1_score': 0.80,
'auc': 0.89
}
bias_analysis = {
'protected_attributes': ['gender', 'race', 'age'],
'fairness_metrics': {'demographic_parity': 0.85, 'equalized_odds': 0.83}
}
intended_use = {
'primary_use_cases': ['Credit risk assessment', 'Loan approval decisions'],
'out_of_scope_uses': ['Employment decisions', 'Insurance pricing']
}
limitations = {
'known_limitations': ['Limited to certain geographic regions', 'Requires recent financial data'],
'potential_risks': ['May impact underrepresented groups', 'Performance degradation over time']
}
# Create model card
model_card = governance.create_model_card(
model_info, performance_metrics, bias_analysis, intended_use, limitations
)
print("3. Model card created")
# Compliance assessment
deployment_context = {
'geographic_scope': 'EU',
'use_case': 'lending'
}
compliance_assessment = governance.compliance_assessment(model_info, deployment_context)
print("4. Compliance assessment completed")
print(f" Overall compliance score: {compliance_assessment['overall_compliance_score']:.2%}")
print(f" Applicable regulations: {compliance_assessment['applicable_regulations']}")
# Audit trail example
governance.audit_trail_management('deploy', 'user123', 'credit_model_v2.1',
{'environment': 'production', 'approval_id': 'APP001'})
print("5. Audit trail updated")
return {
'governance_policies': policies,
'compliance_framework': compliance_framework,
'model_card': model_card,
'compliance_assessment': compliance_assessment
}if name == “main”: deployment_result = demonstrate_governance_framework() print(“=== Model Governance Framework Demonstration Complete ===”)
\newpage
## 10.4 Practical Labs
### Lab 10.1: Comprehensive Bias Detection and Mitigation
**Objective**: Detect and mitigate bias in a real-world dataset
```python
def comprehensive_bias_lab():
"""Complete lab for bias detection and mitigation"""
print("=== Lab 10.1: Comprehensive Bias Detection and Mitigation ===\n")
# Initialize bias detection framework
bias_framework = BiasDetectionFramework()
# Generate biased dataset
dataset = bias_framework.generate_biased_dataset(n_samples=5000)
print("1. Dataset Generated")
print(f" Shape: {dataset['X'].shape}")
print(f" Protected attributes: {dataset['protected_attrs'].columns.tolist()}")
# Detect bias
bias_results = bias_framework.detect_bias(
dataset['X'], dataset['y'], dataset['protected_attrs']
)
print("\n2. Bias Detection Results:")
for attr, metrics in bias_results.items():
print(f" {attr}:")
print(f" Demographic Parity: {metrics['demographic_parity']:.3f}")
print(f" Equalized Odds: {metrics['equalized_odds']:.3f}")
# Train biased model
X_train, X_test, y_train, y_test = train_test_split(
dataset['X'], dataset['y'], test_size=0.3, random_state=42
)
biased_model = RandomForestClassifier(random_state=42)
biased_model.fit(X_train, y_train)
# Apply fairness-aware learning
fairness_framework = FairnessAwareML()
# Reweighting approach
fair_weights = fairness_framework.demographic_parity_reweighting(
X_train, y_train, dataset['protected_attrs'].iloc[:len(X_train)]
)
fair_model = RandomForestClassifier(random_state=42)
fair_model.fit(X_train, y_train, sample_weight=fair_weights)
# Compare models
biased_pred = biased_model.predict(X_test)
fair_pred = fair_model.predict(X_test)
print("\n3. Model Comparison:")
print(" Biased Model:")
print(f" Accuracy: {accuracy_score(y_test, biased_pred):.3f}")
print(" Fair Model:")
print(f" Accuracy: {accuracy_score(y_test, fair_pred):.3f}")
# Fairness evaluation
test_protected = dataset['protected_attrs'].iloc[len(X_train):]
fair_bias_results = bias_framework.detect_bias(
X_test, fair_pred, test_protected
)
print("\n4. Fairness Improvement:")
for attr in bias_results.keys():
old_dp = bias_results[attr]['demographic_parity']
new_dp = fair_bias_results[attr]['demographic_parity']
improvement = ((1 - new_dp) - (1 - old_dp)) / (1 - old_dp) * 100
print(f" {attr} Demographic Parity improvement: {improvement:.1f}%")
# Run the lab
comprehensive_bias_lab()
Objective: Implement and compare multiple interpretability techniques
def interpretability_lab():
"""Complete lab for model interpretability techniques"""
print("=== Lab 10.2: Model Interpretability and Explanation ===\n")
# Initialize interpretability framework
interpretability = ExplainableAI()
# Load and prepare data
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print("1. Model trained on breast cancer dataset")
print(f" Accuracy: {model.score(X_test, y_test):.3f}")
# Global interpretability - Feature importance
feature_importance = interpretability.feature_importance_analysis(
model, X_train, feature_names
)
print("\n2. Global Feature Importance (Top 5):")
for i, (feature, importance) in enumerate(feature_importance[:5]):
print(f" {i+1}. {feature}: {importance:.3f}")
# Local interpretability - LIME
sample_idx = 0
lime_explanation = interpretability.lime_explanation(
model, X_train, X_test[sample_idx:sample_idx+1], feature_names
)
print(f"\n3. LIME Explanation for sample {sample_idx}:")
print(f" Prediction: {'Malignant' if model.predict(X_test[sample_idx:sample_idx+1])[0] == 0 else 'Benign'}")
print(f" Confidence: {max(model.predict_proba(X_test[sample_idx:sample_idx+1])[0]):.3f}")
# SHAP values
shap_values = interpretability.shap_explanation(model, X_train, X_test[:5])
print("\n4. SHAP Analysis completed for 5 test samples")
# Counterfactual explanation
counterfactual = interpretability.counterfactual_explanation(
model, X_test[sample_idx], feature_names
)
print(f"\n5. Counterfactual Explanation:")
print(f" To change prediction, modify:")
for feature, change in counterfactual.items():
if abs(change) > 0.1: # Only show significant changes
print(f" - {feature}: {change:+.2f}")
# Run the lab
interpretability_lab()Objective: Build a complete ML deployment pipeline with monitoring
def deployment_pipeline_lab():
"""Complete lab for production deployment pipeline"""
print("=== Lab 10.3: Production Deployment Pipeline ===\n")
# Initialize deployment framework
deployment = ProductionDeployment()
# Create model artifacts
model_artifacts = deployment.create_model_artifacts()
print("1. Model artifacts created")
# Setup monitoring
monitoring = deployment.setup_monitoring()
print("2. Monitoring dashboard configured")
# Simulate production deployment
deployment_config = {
'environment': 'staging',
'replicas': 2,
'resource_limits': {'cpu': '500m', 'memory': '1Gi'},
'auto_scaling': True,
'health_checks': True
}
deployment_status = deployment.deploy_model(
model_artifacts['model_path'], deployment_config
)
print("3. Model deployed to staging environment")
print(f" Status: {deployment_status['status']}")
print(f" Endpoint: {deployment_status['endpoint']}")
# A/B testing setup
ab_test_config = {
'control_model': 'model_v1.0',
'treatment_model': 'model_v1.1',
'traffic_split': 0.2,
'success_metric': 'conversion_rate',
'minimum_sample_size': 1000
}
ab_test = deployment.setup_ab_testing(ab_test_config)
print("4. A/B testing configured")
# Model governance
governance = ModelGovernance()
# Create governance policies
policies = governance.create_governance_policies()
print("5. Governance policies established")
# Generate model card
model_info = {
'name': 'Customer Churn Predictor',
'version': '1.1.0',
'type': 'Binary Classification',
'intended_use': 'Identify customers at risk of churning'
}
model_card = governance.create_model_card(
model_info,
{'accuracy': 0.87, 'precision': 0.82, 'recall': 0.79},
{'demographic_parity': 0.85},
{'primary_use_cases': ['Customer retention']},
{'known_limitations': ['Limited to subscription customers']}
)
print("6. Model card generated")
return {
'deployment_status': deployment_status,
'monitoring_config': monitoring,
'ab_test_config': ab_test,
'governance_framework': policies,
'model_card': model_card
}
# Run the lab
deployment_result = deployment_pipeline_lab()
print("\n=== Deployment Pipeline Lab Complete ===")class EthicalMLChecklist:
"""Comprehensive checklist for ethical ML development"""
def __init__(self):
self.checklist_items = {
'data_ethics': [
'Data collection consent obtained',
'Privacy-preserving techniques applied',
'Data minimization principles followed',
'Bias in data sources identified and documented',
'Data provenance and lineage tracked'
],
'model_development': [
'Fairness metrics defined and measured',
'Multiple bias detection methods applied',
'Model interpretability requirements met',
'Robust evaluation across subgroups performed',
'Failure modes identified and documented'
],
'deployment_ethics': [
'Impact assessment completed',
'Stakeholder feedback incorporated',
'Monitoring systems for bias established',
'Human oversight mechanisms in place',
'Rollback procedures defined'
],
'governance': [
'Model card created and maintained',
'Audit trails established',
'Compliance requirements verified',
'Regular bias audits scheduled',
'Ethical review board approval obtained'
]
}
def evaluate_project(self, project_details):
"""Evaluate project against ethical standards"""
evaluation_results = {}
for category, items in self.checklist_items.items():
category_score = 0
category_details = []
for item in items:
# Simplified evaluation logic
is_compliant = project_details.get(item.lower().replace(' ', '_'), False)
category_score += 1 if is_compliant else 0
category_details.append({
'item': item,
'compliant': is_compliant
})
evaluation_results[category] = {
'score': category_score / len(items),
'details': category_details
}
overall_score = sum(result['score'] for result in evaluation_results.values()) / len(evaluation_results)
return {
'overall_ethical_score': overall_score,
'category_scores': evaluation_results,
'recommendations': self._generate_recommendations(evaluation_results)
}
def _generate_recommendations(self, evaluation_results):
"""Generate recommendations based on evaluation"""
recommendations = []
for category, result in evaluation_results.items():
if result['score'] < 0.8: # Below 80% compliance
non_compliant_items = [
detail['item'] for detail in result['details']
if not detail['compliant']
]
recommendations.append({
'category': category,
'priority': 'High' if result['score'] < 0.5 else 'Medium',
'action_items': non_compliant_items
})
return recommendations
# Example usage
def demonstrate_ethical_checklist():
"""Demonstrate ethical ML checklist evaluation"""
print("=== Ethical ML Development Checklist ===\n")
checklist = EthicalMLChecklist()
# Example project evaluation
project_details = {
'data_collection_consent_obtained': True,
'privacy-preserving_techniques_applied': True,
'data_minimization_principles_followed': False,
'bias_in_data_sources_identified_and_documented': True,
'fairness_metrics_defined_and_measured': True,
'model_interpretability_requirements_met': False,
'impact_assessment_completed': True,
'human_oversight_mechanisms_in_place': True,
'model_card_created_and_maintained': False,
'audit_trails_established': True
}
evaluation = checklist.evaluate_project(project_details)
print(f"Overall Ethical Score: {evaluation['overall_ethical_score']:.1%}")
print("\nCategory Scores:")
for category, result in evaluation['category_scores'].items():
print(f" {category.replace('_', ' ').title()}: {result['score']:.1%}")
print("\nRecommendations:")
for rec in evaluation['recommendations']:
print(f" {rec['category'].replace('_', ' ').title()} ({rec['priority']} Priority):")
for action in rec['action_items']:
print(f" - {action}")
return evaluation
# Run demonstration
ethical_evaluation = demonstrate_ethical_checklist()Difficulty: Intermediate
Given a hiring dataset, identify potential sources of bias and implement detection methods.
# Exercise template
def hiring_bias_analysis():
"""
TODO: Implement bias detection for hiring dataset
Tasks:
1. Load hiring dataset with protected attributes
2. Identify potential bias sources
3. Calculate fairness metrics
4. Recommend mitigation strategies
5. Implement and evaluate one mitigation method
Expected output:
- Bias analysis report
- Fairness metrics before/after mitigation
- Recommendations for improvement
"""
pass
# Your implementation hereDifficulty: Advanced
Build an explainable AI system for a medical diagnosis model.
def medical_diagnosis_explainer():
"""
TODO: Create explainable AI for medical diagnosis
Tasks:
1. Train a medical diagnosis model
2. Implement LIME and SHAP explanations
3. Create feature importance rankings
4. Generate counterfactual explanations
5. Build visualization dashboard
Expected output:
- Model with multiple explanation methods
- Comparative analysis of explanation techniques
- Interactive visualization of explanations
"""
pass
# Your implementation hereDifficulty: Advanced
Design and implement a complete ML deployment pipeline.
def complete_deployment_pipeline():
"""
TODO: Build end-to-end deployment pipeline
Tasks:
1. Create model training pipeline
2. Implement automated testing
3. Build deployment automation
4. Setup monitoring and alerting
5. Implement A/B testing framework
6. Create governance documentation
Expected output:
- Complete deployment infrastructure
- Monitoring dashboard
- A/B testing results
- Governance compliance report
"""
pass
# Your implementation hereDifficulty: Expert
Develop a comprehensive ethical AI framework for your organization.
def ethical_ai_framework():
"""
TODO: Create organizational ethical AI framework
Tasks:
1. Define ethical principles and guidelines
2. Create bias detection and mitigation protocols
3. Establish governance and oversight processes
4. Design compliance monitoring systems
5. Create training and education materials
6. Implement framework validation procedures
Expected output:
- Complete ethical AI framework document
- Implementation guidelines
- Compliance monitoring tools
- Training materials
"""
pass
# Your implementation hereIn this chapter, we explored the critical aspects of ethics and deployment in machine learning:
The concepts and techniques learned in this chapter are essential for: - Responsible AI Development: Building ethical and fair ML systems - Regulatory Compliance: Meeting legal and industry requirements - Production ML Systems: Deploying reliable and monitored ML models - Stakeholder Trust: Creating transparent and accountable AI systems - Risk Management: Mitigating bias and ensuring safe AI deployment
As we reach the end of this transformative journey together, pause for a moment and reflect on the incredible transformation you’ve undergone. You began this book as a curious learner, perhaps intimidated by the mathematical complexity and overwhelmed by the possibilities. You now stand as a Guardian of Algorithmic Wisdom✅equipped not just with technical skills, but with the ethical compass to use them responsibly.
The field of AI ethics is still being written, and you are now one of its authors. As you venture forth, carry these profound questions with you:
🤔 The Consciousness Question: As AI systems become more sophisticated, how will we recognize and respect emergent forms of machine intelligence?
🌍 The Global Equity Challenge: How can we ensure that AI’s benefits reach every corner of humanity, not just the technologically privileged?
🔮 The Singularity Paradox: How do we maintain human agency and meaning in a world where machines surpass human cognitive abilities?
“��️ The Governance Puzzle: What new forms of democratic participation and oversight will emerge to govern AI systems that affect billions?
🧬 The Human Enhancement Dilemma: Where do we draw the line between using AI to augment human capabilities and fundamentally altering what it means to be human?
You are not just a practitioner of machine learning✅you are a co-author of humanity’s next chapter. The algorithms you build, the biases you eliminate, the fairness you embed, and the transparency you provide will ripple through time, affecting generations yet unborn.
Your formal education in machine learning may be complete, but your real education is just beginning. The field evolves so rapidly that the cutting-edge technique of today becomes tomorrow’s foundation. Embrace this constant evolution as the source of endless wonder and opportunity.
Remember that you don’t walk this path alone. You’re joining a global community of AI practitioners who share your commitment to building technology that serves humanity’s highest aspirations. Seek out mentors, find colleagues who challenge your thinking, and always be ready to mentor the next generation of guardians.
As you close this book and open your code editor, ask yourself: What kind of future do you want to help create? Your answer to this question will guide every algorithmic decision, every model architecture choice, and every deployment strategy you make.
The tools are in your hands. The theory lives in your mind. The wisdom rests in your heart.
Now go forth and build the future we all deserve to inherit.
“The best time to plant a tree was 20 years ago. The second best time is now. The best time to build ethical AI was at the dawn of the field. The second best time is right now.” ✅ Ancient Proverb, adapted for the AI age
The adventure continues…
This chapter concludes our comprehensive journey through machine learning theory and practice. The ethical considerations and deployment practices covered here are crucial for responsible AI development and will serve as the foundation for your professional machine learning career.
End of Chapter 10
This chapter has equipped you with the essential knowledge and practical skills needed to develop, deploy, and maintain ethical, fair, and responsible machine learning systems in production environments. # Appendix A: Python Environment Setup
This appendix provides comprehensive instructions for setting up a Python environment for machine learning development. We’ll cover multiple approaches to ensure you can work with the examples and exercises throughout this textbook.
Anaconda is a complete Python distribution that includes: - Python interpreter - 250+ pre-installed packages for data science - Conda package manager - Anaconda Navigator GUI - Jupyter Notebook, Spyder IDE, and other tools
Miniconda is a minimal installer that includes: - Python interpreter - Conda package manager - Basic packages only
Recommendation: Use Miniconda for more control over your environment, or Anaconda for convenience.
# Using Homebrew (recommended)
brew install --cask anaconda
# Or download from website and install manually
# Download .pkg file from anaconda.com and run installer# Download the installer
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
# Make it executable and run
chmod +x Anaconda3-2023.09-0-Linux-x86_64.sh
./Anaconda3-2023.09-0-Linux-x86_64.sh
# Follow the prompts and add conda to PATH
echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc# Check conda installation
conda --version
# Check Python installation
python --version
# List installed packages
conda listVirtual environments provide: - Isolation: Separate package installations for different projects - Reproducibility: Consistent environments across machines - Dependency Management: Avoid version conflicts - Clean System: Keep base Python installation clean
# Create a new environment
conda create --name ml-textbook python=3.9
# Activate the environment
conda activate ml-textbook
# Deactivate when done
conda deactivate# Create environment with essential packages
conda create --name ml-textbook python=3.9 \
numpy pandas matplotlib seaborn \
scikit-learn jupyter notebook
# Create from environment file
conda env create -f environment.ymlname: ml-textbook
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- numpy=1.24.0
- pandas=2.0.0
- matplotlib=3.7.0
- seaborn=0.12.0
- scikit-learn=1.3.0
- jupyter=1.0.0
- notebook=6.5.0
- ipykernel=6.22.0
- pip=23.0.0
- pip:
- plotly==5.14.0
- shap==0.41.0
- lime==0.2.0.1# List all environments
conda env list
conda info --envs
# Clone an environment
conda create --name ml-textbook-copy --clone ml-textbook
# Export environment
conda env export > environment.yml
# Remove environment
conda env remove --name ml-textbook
# Update environment from file
conda env update --name ml-textbook --file environment.yml# Create virtual environment
python -m venv ml-textbook-env
# Activate (Windows)
ml-textbook-env\Scripts\activate
# Activate (macOS/Linux)
source ml-textbook-env/bin/activate
# Install packages
pip install numpy pandas matplotlib scikit-learn jupyter
# Create requirements file
pip freeze > requirements.txt
# Install from requirements
pip install -r requirements.txt
# Deactivate
deactivate# Install Jupyter (if not already installed)
conda install jupyter notebook
# Or with pip
pip install jupyter notebook
# Install JupyterLab (modern interface)
conda install jupyterlab
# Or with pip
pip install jupyterlab# Generate config file
jupyter notebook --generate-config
# Config file location
# Windows: C:\Users\username\.jupyter\jupyter_notebook_config.py
# macOS/Linux: ~/.jupyter/jupyter_notebook_config.py# ~/.jupyter/jupyter_notebook_config.py
# Set default directory
c.NotebookApp.notebook_dir = '/path/to/your/projects'
# Enable extensions
c.NotebookApp.nbserver_extensions = {
'jupyter_nbextensions_configurator': True,
'nbgrader.server_extensions.formgrader': True,
}
# Security settings
c.NotebookApp.token = '' # Disable token for local use (less secure)
c.NotebookApp.password = '' # Or set password
# Browser settings
c.NotebookApp.open_browser = True
c.NotebookApp.port = 8888
# Auto-save interval (in seconds)
c.FileContentsManager.autosave_interval = 60# Install nbextensions
conda install -c conda-forge jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
# Install configurator
conda install -c conda-forge jupyter_nbextensions_configurator
jupyter nbextensions_configurator enable --user
# Popular extensions to enable:
# - Variable Inspector
# - Code Folding
# - Table of Contents (2)
# - Autopep8
# - ExecuteTime# Add environment as Jupyter kernel
conda activate ml-textbook
python -m ipykernel install --user --name ml-textbook --display-name "ML Textbook"
# List available kernels
jupyter kernelspec list
# Remove kernel
jupyter kernelspec uninstall ml-textbook# Data manipulation and analysis
conda install numpy pandas
# Visualization
conda install matplotlib seaborn plotly
# Machine learning
conda install scikit-learn
# Deep learning (optional)
conda install tensorflow pytorch
# Statistical analysis
conda install scipy statsmodels
# Jupyter ecosystem
conda install jupyter notebook jupyterlab ipywidgets
# Development tools
conda install autopep8 black flake8
# Additional ML tools
pip install shap lime xgboost lightgbm catboostCreate requirements.txt:
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
seaborn>=0.11.0
scikit-learn>=1.1.0
jupyter>=1.0.0
notebook>=6.4.0
ipykernel>=6.0.0
plotly>=5.0.0
scipy>=1.7.0
statsmodels>=0.13.0
# Interpretability
shap>=0.40.0
lime>=0.2.0
# Gradient boosting
xgboost>=1.6.0
lightgbm>=3.3.0
catboost>=1.1.0
# Additional utilities
tqdm>=4.60.0
joblib>=1.1.0
pillow>=8.3.0
Create setup_environment.py:
#!/usr/bin/env python3
"""
Setup script for ML Textbook environment
"""
import subprocess
import sys
import os
def run_command(command):
"""Run shell command and handle errors"""
try:
result = subprocess.run(command, shell=True, check=True,
capture_output=True, text=True)
print(f"✅ {command}")
return result.stdout
except subprocess.CalledProcessError as e:
print(f"✅ {command}")
print(f"Error: {e.stderr}")
return None
def setup_conda_environment():
"""Setup conda environment for ML textbook"""
print("Setting up ML Textbook Environment...")
# Check if conda is available
if run_command("conda --version") is None:
print("Conda not found. Please install Anaconda or Miniconda first.")
return False
# Create environment
env_command = """
conda create --name ml-textbook python=3.9 -y &&
conda activate ml-textbook &&
conda install -c conda-forge numpy pandas matplotlib seaborn scikit-learn jupyter notebook ipykernel -y &&
python -m ipykernel install --user --name ml-textbook --display-name "ML Textbook" &&
pip install shap lime xgboost lightgbm plotly tqdm
"""
if run_command(env_command):
print("\n✅ Environment setup complete!")
print("To activate: conda activate ml-textbook")
print("To start Jupyter: jupyter notebook")
return True
else:
print("✅ Environment setup failed!")
return False
def verify_installation():
"""Verify that all packages are properly installed"""
print("\nVerifying installation...")
required_packages = [
'numpy', 'pandas', 'matplotlib', 'seaborn',
'sklearn', 'jupyter', 'shap', 'lime'
]
for package in required_packages:
try:
__import__(package)
print(f"✅ {package}")
except ImportError:
print(f"✅ {package} - Not installed")
if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "verify":
verify_installation()
else:
setup_conda_environment()# Add conda to PATH (Linux/macOS)
echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Windows: Add to system PATH through Environment Variables
# C:\Users\username\Anaconda3\Scripts# Update conda
conda update conda
# Clear package cache
conda clean --all
# Use different channels
conda install -c conda-forge package_name
# Use pip as fallback
pip install package_name# Linux/macOS: Use --user flag
pip install --user package_name
# Windows: Run as Administrator or use --user flag# Check if running
jupyter notebook list
# Kill existing processes
pkill -f jupyter
# Restart with specific port
jupyter notebook --port=8889
# Reset configuration
jupyter notebook --generate-config# Refresh kernel list
jupyter kernelspec list
# Install kernel for current environment
python -m ipykernel install --user --name $(basename $CONDA_DEFAULT_ENV)
# Fix kernel connection
pip install --upgrade jupyter jupyter-client# Check Python path
import sys
print(sys.path)
# Check installed packages
import pkg_resources
installed_packages = [d.project_name for d in pkg_resources.working_set]
print(sorted(installed_packages))# Check package versions
conda list package_name
pip show package_name
# Update specific package
conda update package_name
pip install --upgrade package_name
# Force reinstall
pip install --force-reinstall package_name# Reinitialize conda
conda init
# Check environment path
conda info --envs
# Recreate environment
conda env remove --name ml-textbook
conda create --name ml-textbook python=3.9# Install VS Code extensions
code --install-extension ms-python.python
code --install-extension ms-toolsai.jupyter
code --install-extension ms-python.flake8
code --install-extension ms-python.black-formatter# Install formatting tools
pip install black autopep8 flake8 isort
# Format code
black your_script.py
autopep8 --in-place --aggressive your_script.py
# Check style
flake8 your_script.py
# Sort imports
isort your_script.py# Configure Git for Jupyter notebooks
pip install nbstripout
# Remove output from notebooks before committing
nbstripout --install
# Configure Git
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
# Create .gitignore for Python projects
echo "*.pyc
__pycache__/
.ipynb_checkpoints/
.env
.venv/
*.egg-info/
build/
dist/" > .gitignore# Monitor memory usage
import psutil
import os
def memory_usage():
process = psutil.Process(os.getpid())
return process.memory_info().rss / 1024 / 1024 # MB
# Optimize pandas memory usage
import pandas as pd
def optimize_dataframe(df):
"""Optimize DataFrame memory usage"""
for col in df.columns:
col_type = df[col].dtype
if col_type != 'object':
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float32)
return df# Configure joblib for scikit-learn
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Use all available cores
clf = RandomForestClassifier(n_jobs=-1)
grid_search = GridSearchCV(clf, param_grid, n_jobs=-1)
# Configure number of threads
import os
os.environ['OMP_NUM_THREADS'] = '4'
os.environ['MKL_NUM_THREADS'] = '4'# Environment management
conda create --name myenv python=3.9
conda activate myenv
conda deactivate
conda env list
conda env remove --name myenv
# Package management
conda install package_name
pip install package_name
conda update package_name
pip install --upgrade package_name
# Jupyter
jupyter notebook
jupyter lab
jupyter kernelspec list
jupyter nbextension enable --py widgetsnbextension# Standard imports for ML projects
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
# Jupyter notebook settings
%matplotlib inline
%load_ext autoreload
%autoreload 2# Time execution
%time code_line
%%time
# cell content
# Memory profiling
%memit code_line
%%memit
# cell content
# Load external Python files
%load filename.py
%run filename.py
# System commands
!pip install package_name
!ls
!pwd
# Variable information
%whos
%who_ls
# Debug mode
%debug
%pdb on# basic_ml_environment.yml
name: basic-ml
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- numpy
- pandas
- matplotlib
- scikit-learn
- jupyter
- notebook# advanced_ml_environment.yml
name: advanced-ml
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- numpy=1.24.0
- pandas=2.0.0
- matplotlib=3.7.0
- seaborn=0.12.0
- scikit-learn=1.3.0
- scipy=1.10.0
- statsmodels=0.14.0
- jupyter=1.0.0
- notebook=6.5.0
- jupyterlab=3.6.0
- ipykernel=6.22.0
- ipywidgets=8.0.0
- pip=23.0.0
- pip:
- shap>=0.41.0
- lime>=0.2.0
- plotly>=5.14.0
- xgboost>=1.7.0
- lightgbm>=3.3.0
- catboost>=1.2.0
- optuna>=3.1.0
- mlflow>=2.3.0# deep_learning_environment.yml
name: deep-learning
channels:
- conda-forge
- pytorch
- defaults
dependencies:
- python=3.9
- numpy
- pandas
- matplotlib
- scikit-learn
- jupyter
- pytorch
- torchvision
- tensorflow
- keras
- pip:
- transformers
- datasets
- accelerateThis completes Appendix A with comprehensive Python environment setup instructions, troubleshooting guides, and templates for different types of machine learning projects. # Appendix B: Mathematical Foundations
This appendix provides a comprehensive review of the mathematical concepts essential for understanding machine learning algorithms. The material is designed to serve as both a refresher for those familiar with these topics and a learning resource for newcomers.
A vector v in n-dimensional space is represented as: $$\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}$$
Vector Addition: $$\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}$$
Scalar Multiplication: $$c\mathbf{v} = \begin{bmatrix} cv_1 \\ cv_2 \\ \vdots \\ cv_n \end{bmatrix}$$
Dot Product: $$\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1v_1 + u_2v_2 + \cdots + u_nv_n$$
Vector Magnitude (Norm): $$||\mathbf{v}||_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} = \sqrt{\mathbf{v} \cdot \mathbf{v}}$$
import numpy as np
import matplotlib.pyplot as plt
# Vector operations in Python
def demonstrate_vector_operations():
"""Demonstrate basic vector operations"""
# Define vectors
u = np.array([3, 4])
v = np.array([1, 2])
print("Vector Operations Demo")
print(f"u = {u}")
print(f"v = {v}")
# Addition
addition = u + v
print(f"u + v = {addition}")
# Scalar multiplication
scalar_mult = 2 * u
print(f"2 * u = {scalar_mult}")
# Dot product
dot_product = np.dot(u, v)
print(f"u · v = {dot_product}")
# Magnitude
magnitude_u = np.linalg.norm(u)
magnitude_v = np.linalg.norm(v)
print(f"||u|| = {magnitude_u:.3f}")
print(f"||v|| = {magnitude_v:.3f}")
# Visualization
plt.figure(figsize=(10, 4))
# Plot vectors
plt.subplot(1, 2, 1)
plt.quiver(0, 0, u[0], u[1], angles='xy', scale_units='xy', scale=1, color='blue', label='u')
plt.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, color='red', label='v')
plt.quiver(0, 0, addition[0], addition[1], angles='xy', scale_units='xy', scale=1, color='green', label='u+v')
plt.xlim(-1, 6)
plt.ylim(-1, 7)
plt.grid(True)
plt.legend()
plt.title('Vector Addition')
plt.axis('equal')
# Plot dot product geometric interpretation
plt.subplot(1, 2, 2)
theta = np.arccos(dot_product / (magnitude_u * magnitude_v))
plt.quiver(0, 0, u[0], u[1], angles='xy', scale_units='xy', scale=1, color='blue', label=f'u (||u||={magnitude_u:.2f})')
plt.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, color='red', label=f'v (||v||={magnitude_v:.2f})')
# Project v onto u
proj_v_on_u = (dot_product / (magnitude_u**2)) * u
plt.quiver(0, 0, proj_v_on_u[0], proj_v_on_u[1], angles='xy', scale_units='xy', scale=1, color='orange', label='proj_u(v)')
plt.xlim(-1, 4)
plt.ylim(-1, 5)
plt.grid(True)
plt.legend()
plt.title(f'Dot Product: {dot_product:.2f}\nAngle: {np.degrees(theta):.1f}°')
plt.axis('equal')
plt.tight_layout()
plt.show()
return {
'vectors': {'u': u, 'v': v},
'operations': {
'addition': addition,
'dot_product': dot_product,
'angle': np.degrees(theta)
}
}
# Run demonstration
vector_demo = demonstrate_vector_operations()An m×n matrix A is a rectangular array: $$\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}$$
Matrix Addition: (A + B)ij = aij + bij
Matrix Multiplication: $$(\mathbf{AB})_{ij} = \sum_{k=1}^{p} a_{ik}b_{kj}$$
Transpose: (AT)ij = aji
Identity Matrix: $$\mathbf{I} = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}$$
def demonstrate_matrix_operations():
"""Demonstrate essential matrix operations"""
print("Matrix Operations Demo")
# Define matrices
A = np.array([[1, 2, 3],
[4, 5, 6]])
B = np.array([[7, 8],
[9, 10],
[11, 12]])
C = np.array([[1, 1],
[2, 2]])
print(f"Matrix A (2x3):\n{A}")
print(f"Matrix B (3x2):\n{B}")
print(f"Matrix C (2x2):\n{C}")
# Matrix multiplication
AB = np.dot(A, B) # or A @ B
print(f"\nA × B (2x2):\n{AB}")
# Matrix addition (same dimensions)
AB_plus_C = AB + C
print(f"\n(A × B) + C:\n{AB_plus_C}")
# Transpose
A_T = A.T
print(f"\nA^T (3x2):\n{A_T}")
# Identity matrix
I = np.eye(2)
print(f"\nIdentity matrix (2x2):\n{I}")
# Verify identity property
AB_times_I = AB @ I
print(f"\n(A × B) × I = A × B:\n{AB_times_I}")
print(f"Equal to AB? {np.allclose(AB, AB_times_I)}")
return {
'matrices': {'A': A, 'B': B, 'C': C},
'products': {'AB': AB, 'AB_plus_C': AB_plus_C},
'transpose': A_T
}
matrix_demo = demonstrate_matrix_operations()For a square matrix A, an eigenvector v and eigenvalue λ satisfy: Av = λv
det (A − λI) = 0
def demonstrate_eigenvalues():
"""Demonstrate eigenvalue decomposition"""
print("Eigenvalue Decomposition Demo")
# Create a symmetric matrix
A = np.array([[4, -2],
[-2, 1]])
print(f"Matrix A:\n{A}")
# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"\nEigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}")
# Verify Av = λv for each eigenvalue/eigenvector pair
for i, (λ, v) in enumerate(zip(eigenvalues, eigenvectors.T)):
Av = A @ v
λv = λ * v
print(f"\nEigenvalue {i+1}: λ = {λ:.3f}")
print(f"Eigenvector: v = {v}")
print(f"Av = {Av}")
print(f"λv = {λv}")
print(f"Av "�� λv? {np.allclose(Av, λv)}")
# Visualization
plt.figure(figsize=(12, 4))
# Original vectors
plt.subplot(1, 3, 1)
x = np.linspace(-3, 3, 10)
y = np.linspace(-3, 3, 10)
X, Y = np.meshgrid(x, y)
# Apply transformation to grid
for i in range(len(x)):
for j in range(len(y)):
if i % 2 == 0 and j % 2 == 0: # Subsample for clarity
original = np.array([X[i,j], Y[i,j]])
if np.linalg.norm(original) > 0.1: # Avoid zero vector
transformed = A @ original
plt.arrow(0, 0, original[0], original[1],
head_width=0.1, head_length=0.1,
fc='blue', ec='blue', alpha=0.3)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.grid(True)
plt.title('Original Vectors')
plt.axis('equal')
# Transformed vectors
plt.subplot(1, 3, 2)
for i in range(len(x)):
for j in range(len(y)):
if i % 2 == 0 and j % 2 == 0:
original = np.array([X[i,j], Y[i,j]])
if np.linalg.norm(original) > 0.1:
transformed = A @ original
plt.arrow(0, 0, transformed[0], transformed[1],
head_width=0.1, head_length=0.1,
fc='red', ec='red', alpha=0.3)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.grid(True)
plt.title('Transformed Vectors')
plt.axis('equal')
# Eigenvectors
plt.subplot(1, 3, 3)
colors = ['green', 'purple']
for i, (λ, v) in enumerate(zip(eigenvalues, eigenvectors.T)):
# Plot eigenvector and its transformation
plt.arrow(0, 0, v[0], v[1],
head_width=0.1, head_length=0.1,
fc=colors[i], ec=colors[i],
label=f'v{i+1} (λ={λ:.2f})')
plt.arrow(0, 0, λ*v[0], λ*v[1],
head_width=0.1, head_length=0.1,
fc=colors[i], ec=colors[i], alpha=0.5,
linestyle='--')
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.grid(True)
plt.title('Eigenvectors and Eigenvalues')
plt.legend()
plt.axis('equal')
plt.tight_layout()
plt.show()
return {
'matrix': A,
'eigenvalues': eigenvalues,
'eigenvectors': eigenvectors
}
eigen_demo = demonstrate_eigenvalues()PCA finds the eigenvectors of the covariance matrix to identify principal components.
Given data matrix X (n×p), the covariance matrix is: $$\mathbf{C} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X}$$
The principal components are the eigenvectors of C corresponding to the largest eigenvalues.
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
def demonstrate_pca_mathematics():
"""Demonstrate the mathematics behind PCA"""
print("PCA Mathematics Demo")
# Load iris dataset
data = load_iris()
X = data.data
y = data.target
print(f"Data shape: {X.shape}")
# Center the data
X_centered = X - np.mean(X, axis=0)
# Calculate covariance matrix
n = X_centered.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n - 1)
print(f"Covariance matrix:\n{cov_matrix}")
# Manual PCA calculation
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalue magnitude
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
print(f"\nEigenvalues: {eigenvalues}")
print(f"Explained variance ratio: {eigenvalues / np.sum(eigenvalues)}")
# Project data onto first two principal components
PC1 = eigenvectors[:, 0]
PC2 = eigenvectors[:, 1]
projected_data = X_centered @ np.column_stack([PC1, PC2])
# Compare with sklearn PCA
pca = PCA(n_components=2)
sklearn_projected = pca.fit_transform(X)
print(f"\nManual PCA first PC: {PC1}")
print(f"Sklearn PCA first PC: {pca.components_[0]}")
print(f"Components match? {np.allclose(np.abs(PC1), np.abs(pca.components_[0]))}")
# Visualization
plt.figure(figsize=(15, 5))
# Original data (first two features)
plt.subplot(1, 3, 1)
colors = ['red', 'green', 'blue']
for i in range(3):
mask = y == i
plt.scatter(X[mask, 0], X[mask, 1], c=colors[i], label=data.target_names[i])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Original Data (First 2 Features)')
plt.legend()
plt.grid(True)
# PCA projected data (manual)
plt.subplot(1, 3, 2)
for i in range(3):
mask = y == i
plt.scatter(projected_data[mask, 0], projected_data[mask, 1], c=colors[i])
plt.xlabel(f'PC1 ({eigenvalues[0]/np.sum(eigenvalues):.1%} var)')
plt.ylabel(f'PC2 ({eigenvalues[1]/np.sum(eigenvalues):.1%} var)')
plt.title('Manual PCA Projection')
plt.grid(True)
# PCA projected data (sklearn)
plt.subplot(1, 3, 3)
for i in range(3):
mask = y == i
plt.scatter(sklearn_projected[mask, 0], sklearn_projected[mask, 1], c=colors[i])
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)')
plt.title('Sklearn PCA Projection')
plt.grid(True)
plt.tight_layout()
plt.show()
return {
'covariance_matrix': cov_matrix,
'eigenvalues': eigenvalues,
'eigenvectors': eigenvectors,
'manual_projection': projected_data,
'sklearn_projection': sklearn_projected
}
pca_demo = demonstrate_pca_mathematics()import scipy.stats as stats
def demonstrate_descriptive_statistics():
"""Demonstrate descriptive statistics"""
print("Descriptive Statistics Demo")
# Generate sample data
np.random.seed(42)
data_normal = np.random.normal(50, 15, 1000)
data_skewed = np.random.exponential(2, 1000)
datasets = {
'Normal Distribution': data_normal,
'Skewed Distribution': data_skewed
}
plt.figure(figsize=(15, 10))
for i, (name, data) in enumerate(datasets.items()):
# Calculate statistics
mean = np.mean(data)
median = np.median(data)
std = np.std(data, ddof=1)
variance = np.var(data, ddof=1)
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)
print(f"\n{name}:")
print(f" Mean: {mean:.2f}")
print(f" Median: {median:.2f}")
print(f" Std Dev: {std:.2f}")
print(f" Variance: {variance:.2f}")
print(f" Skewness: {skewness:.2f}")
print(f" Kurtosis: {kurtosis:.2f}")
# Histogram
plt.subplot(2, 2, 2*i + 1)
plt.hist(data, bins=30, alpha=0.7, density=True, color='skyblue', edgecolor='black')
plt.axvline(mean, color='red', linestyle='--', label=f'Mean: {mean:.2f}')
plt.axvline(median, color='green', linestyle='--', label=f'Median: {median:.2f}')
plt.title(f'{name}\nSkewness: {skewness:.2f}, Kurtosis: {kurtosis:.2f}')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.grid(True, alpha=0.3)
# Box plot
plt.subplot(2, 2, 2*i + 2)
plt.boxplot(data, vert=True)
plt.title(f'{name} - Box Plot')
plt.ylabel('Value')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
return {
'normal_stats': {
'mean': np.mean(data_normal),
'std': np.std(data_normal, ddof=1),
'skewness': stats.skew(data_normal)
},
'skewed_stats': {
'mean': np.mean(data_skewed),
'std': np.std(data_skewed, ddof=1),
'skewness': stats.skew(data_skewed)
}
}
stats_demo = demonstrate_descriptive_statistics()$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$
$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
def demonstrate_probability_distributions():
"""Demonstrate common probability distributions"""
print("Probability Distributions Demo")
plt.figure(figsize=(15, 10))
# Normal Distribution
plt.subplot(2, 3, 1)
x = np.linspace(-4, 4, 100)
for mu, sigma in [(0, 1), (0, 0.5), (1, 1)]:
y = stats.norm.pdf(x, mu, sigma)
plt.plot(x, y, label=f'μ={mu}, σ={sigma}')
plt.title('Normal Distribution')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
# Binomial Distribution
plt.subplot(2, 3, 2)
x = np.arange(0, 21)
for n, p in [(20, 0.3), (20, 0.5), (20, 0.7)]:
y = stats.binom.pmf(x, n, p)
plt.plot(x, y, 'o-', label=f'n={n}, p={p}')
plt.title('Binomial Distribution')
plt.xlabel('k')
plt.ylabel('Probability')
plt.legend()
plt.grid(True)
# Poisson Distribution
plt.subplot(2, 3, 3)
x = np.arange(0, 15)
for lam in [1, 3, 5]:
y = stats.poisson.pmf(x, lam)
plt.plot(x, y, 'o-', label=f'λ={lam}')
plt.title('Poisson Distribution')
plt.xlabel('k')
plt.ylabel('Probability')
plt.legend()
plt.grid(True)
# Exponential Distribution
plt.subplot(2, 3, 4)
x = np.linspace(0, 5, 100)
for lam in [0.5, 1, 2]:
y = stats.expon.pdf(x, scale=1/lam)
plt.plot(x, y, label=f'λ={lam}')
plt.title('Exponential Distribution')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
# Uniform Distribution
plt.subplot(2, 3, 5)
x = np.linspace(-1, 3, 100)
for a, b in [(0, 1), (0, 2), (-0.5, 1.5)]:
y = stats.uniform.pdf(x, a, b-a)
plt.plot(x, y, label=f'a={a}, b={b}')
plt.title('Uniform Distribution')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
# Beta Distribution
plt.subplot(2, 3, 6)
x = np.linspace(0, 1, 100)
for alpha, beta in [(0.5, 0.5), (2, 2), (5, 1)]:
y = stats.beta.pdf(x, alpha, beta)
plt.plot(x, y, label=f'α={alpha}, β={beta}')
plt.title('Beta Distribution')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
return "Probability distributions demonstrated"
prob_demo = demonstrate_probability_distributions()Bayes’ Theorem: $$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$$
Where: - P(A|B) is the posterior probability - P(B|A) is the likelihood - P(A) is the prior probability - P(B) is the marginal probability
def demonstrate_bayes_theorem():
"""Demonstrate Bayes' theorem with practical example"""
print("Bayes' Theorem Demo: Medical Diagnosis")
# Medical diagnosis example
# Disease prevalence (prior)
P_disease = 0.01 # 1% of population has the disease
P_no_disease = 1 - P_disease
# Test accuracy
P_positive_given_disease = 0.95 # Sensitivity (true positive rate)
P_negative_given_no_disease = 0.90 # Specificity (true negative rate)
P_positive_given_no_disease = 1 - P_negative_given_no_disease # False positive rate
# Calculate marginal probability of positive test
P_positive = (P_positive_given_disease * P_disease +
P_positive_given_no_disease * P_no_disease)
# Apply Bayes' theorem
P_disease_given_positive = ((P_positive_given_disease * P_disease) / P_positive)
print(f"Prior probability of disease: {P_disease:.1%}")
print(f"Test sensitivity (P(+|Disease)): {P_positive_given_disease:.1%}")
print(f"Test specificity (P(-|No Disease)): {P_negative_given_no_disease:.1%}")
print(f"Probability of positive test: {P_positive:.1%}")
print(f"Posterior probability (P(Disease|+)): {P_disease_given_positive:.1%}")
# Visualization
plt.figure(figsize=(12, 8))
# Create confusion matrix visualization
population = 100000
diseased = int(population * P_disease)
healthy = population - diseased
true_positives = int(diseased * P_positive_given_disease)
false_negatives = diseased - true_positives
true_negatives = int(healthy * P_negative_given_no_disease)
false_positives = healthy - true_negatives
# Confusion matrix
plt.subplot(2, 2, 1)
conf_matrix = np.array([[true_negatives, false_positives],
[false_negatives, true_positives]])
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Predicted Negative', 'Predicted Positive'],
yticklabels=['Actually Healthy', 'Actually Diseased'])
plt.title('Confusion Matrix\n(Population: 100,000)')
# Prior vs Posterior
plt.subplot(2, 2, 2)
categories = ['Prior\n(Population)', 'Posterior\n(Test Positive)']
probabilities = [P_disease, P_disease_given_positive]
bars = plt.bar(categories, probabilities, color=['lightblue', 'darkblue'])
plt.ylabel('Probability of Disease')
plt.title('Prior vs Posterior Probability')
plt.ylim(0, max(probabilities) * 1.2)
for bar, prob in zip(bars, probabilities):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{prob:.1%}', ha='center', va='bottom')
# Effect of prevalence
plt.subplot(2, 2, 3)
prevalences = np.logspace(-4, -1, 50) # 0.01% to 10%
posteriors = []
for prev in prevalences:
P_pos = P_positive_given_disease * prev + P_positive_given_no_disease * (1 - prev)
posterior = (P_positive_given_disease * prev) / P_pos
posteriors.append(posterior)
plt.semilogx(prevalences * 100, np.array(posteriors) * 100)
plt.axvline(P_disease * 100, color='red', linestyle='--',
label=f'Current prevalence ({P_disease:.1%})')
plt.axhline(P_disease_given_positive * 100, color='red', linestyle='--',
label=f'Current posterior ({P_disease_given_positive:.1%})')
plt.xlabel('Disease Prevalence (%)')
plt.ylabel('Posterior Probability (%)')
plt.title('Effect of Disease Prevalence')
plt.grid(True)
plt.legend()
# Effect of test accuracy
plt.subplot(2, 2, 4)
sensitivities = np.linspace(0.5, 1.0, 50)
posteriors_sens = []
for sens in sensitivities:
P_pos = sens * P_disease + P_positive_given_no_disease * P_no_disease
posterior = (sens * P_disease) / P_pos
posteriors_sens.append(posterior)
plt.plot(sensitivities * 100, np.array(posteriors_sens) * 100)
plt.axvline(P_positive_given_disease * 100, color='red', linestyle='--',
label=f'Current sensitivity ({P_positive_given_disease:.1%})')
plt.axhline(P_disease_given_positive * 100, color='red', linestyle='--',
label=f'Current posterior ({P_disease_given_positive:.1%})')
plt.xlabel('Test Sensitivity (%)')
plt.ylabel('Posterior Probability (%)')
plt.title('Effect of Test Sensitivity')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
return {
'prior': P_disease,
'posterior': P_disease_given_positive,
'confusion_matrix': {
'true_positives': true_positives,
'false_positives': false_positives,
'true_negatives': true_negatives,
'false_negatives': false_negatives
}
}
bayes_demo = demonstrate_bayes_theorem()$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
For multivariable functions: $$\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h, y) - f(x, y)}{h}$$
$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
def demonstrate_gradients():
"""Demonstrate gradient computation and visualization"""
print("Gradient Computation Demo")
# Define a simple function: f(x, y) = x^2 + y^2 + 2xy
def f(x, y):
return x**2 + y**2 + 2*x*y
# Analytical gradients
def grad_f_analytical(x, y):
df_dx = 2*x + 2*y
df_dy = 2*y + 2*x
return np.array([df_dx, df_dy])
# Numerical gradients
def grad_f_numerical(x, y, h=1e-5):
df_dx = (f(x+h, y) - f(x-h, y)) / (2*h)
df_dy = (f(x, y+h) - f(x, y-h)) / (2*h)
return np.array([df_dx, df_dy])
# Test point
x0, y0 = 1.5, 1.0
grad_analytical = grad_f_analytical(x0, y0)
grad_numerical = grad_f_numerical(x0, y0)
print(f"At point ({x0}, {y0}):")
print(f"Analytical gradient: {grad_analytical}")
print(f"Numerical gradient: {grad_numerical}")
print(f"Difference: {np.linalg.norm(grad_analytical - grad_numerical):.2e}")
# Visualization
plt.figure(figsize=(15, 5))
# Function surface
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.subplot(1, 3, 1)
contour = plt.contour(X, Y, Z, levels=20)
plt.colorbar(contour)
plt.plot(x0, y0, 'ro', markersize=8, label=f'Point ({x0}, {y0})')
# Plot gradient vector
plt.arrow(x0, y0, -grad_analytical[0]*0.2, -grad_analytical[1]*0.2,
head_width=0.1, head_length=0.1, fc='red', ec='red',
label='Negative gradient')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Function Contours and Gradient')
plt.legend()
plt.axis('equal')
plt.grid(True)
# Gradient descent visualization
plt.subplot(1, 3, 2)
# Perform gradient descent
x_current, y_current = 2.5, 2.0
learning_rate = 0.1
trajectory_x = [x_current]
trajectory_y = [y_current]
values = [f(x_current, y_current)]
for i in range(20):
grad = grad_f_analytical(x_current, y_current)
x_current -= learning_rate * grad[0]
y_current -= learning_rate * grad[1]
trajectory_x.append(x_current)
trajectory_y.append(y_current)
values.append(f(x_current, y_current))
plt.contour(X, Y, Z, levels=20, alpha=0.6)
plt.plot(trajectory_x, trajectory_y, 'ro-', linewidth=2, markersize=4)
plt.plot(trajectory_x[0], trajectory_y[0], 'go', markersize=10, label='Start')
plt.plot(trajectory_x[-1], trajectory_y[-1], 'bo', markersize=10, label='End')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gradient Descent Trajectory')
plt.legend()
plt.axis('equal')
plt.grid(True)
# Convergence plot
plt.subplot(1, 3, 3)
plt.plot(values, 'b-', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Function Value')
plt.title('Gradient Descent Convergence')
plt.grid(True)
plt.tight_layout()
plt.show()
return {
'analytical_gradient': grad_analytical,
'numerical_gradient': grad_numerical,
'trajectory': {'x': trajectory_x, 'y': trajectory_y, 'values': values}
}
grad_demo = demonstrate_gradients()The chain rule is fundamental to backpropagation in neural networks:
$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial u} \cdot \frac{\partial u}{\partial x}$$
For a composition f(g(x)): $$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$$
def demonstrate_chain_rule():
"""Demonstrate chain rule computation"""
print("Chain Rule and Backpropagation Demo")
# Simple neural network example: f(x) = sigmoid(w2 * relu(w1 * x + b1) + b2)
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
# Network parameters
w1, b1 = 0.5, 0.1
w2, b2 = 0.8, -0.2
# Forward pass
def forward_pass(x):
# Layer 1: linear transformation
z1 = w1 * x + b1
# Layer 1: activation
a1 = relu(z1)
# Layer 2: linear transformation
z2 = w2 * a1 + b2
# Layer 2: activation (output)
a2 = sigmoid(z2)
return {
'x': x, 'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2
}
# Backward pass (manual chain rule)
def backward_pass(forward_values, target):
x = forward_values['x']
z1 = forward_values['z1']
a1 = forward_values['a1']
z2 = forward_values['z2']
a2 = forward_values['a2']
# Loss function: L = 0.5 * (a2 - target)^2
loss = 0.5 * (a2 - target)**2
# Backward pass using chain rule
# dL/da2
dL_da2 = a2 - target
# dL/dz2 = dL/da2 * da2/dz2
dL_dz2 = dL_da2 * sigmoid_derivative(z2)
# dL/dw2 = dL/dz2 * dz2/dw2 = dL/dz2 * a1
dL_dw2 = dL_dz2 * a1
# dL/db2 = dL/dz2 * dz2/db2 = dL/dz2 * 1
dL_db2 = dL_dz2
# dL/da1 = dL/dz2 * dz2/da1 = dL/dz2 * w2
dL_da1 = dL_dz2 * w2
# dL/dz1 = dL/da1 * da1/dz1
dL_dz1 = dL_da1 * relu_derivative(z1)
# dL/dw1 = dL/dz1 * dz1/dw1 = dL/dz1 * x
dL_dw1 = dL_dz1 * x
# dL/db1 = dL/dz1 * dz1/db1 = dL/dz1 * 1
dL_db1 = dL_dz1
return {
'loss': loss,
'dL_dw1': dL_dw1,
'dL_db1': dL_db1,
'dL_dw2': dL_dw2,
'dL_db2': dL_db2
}
# Numerical gradients for verification
def numerical_gradient(x, target, param_name, h=1e-5):
global w1, b1, w2, b2
# Store original value
if param_name == 'w1':
original = w1
w1 += h
loss_plus = 0.5 * (forward_pass(x)['a2'] - target)**2
w1 = original - h
loss_minus = 0.5 * (forward_pass(x)['a2'] - target)**2
w1 = original
elif param_name == 'b1':
original = b1
b1 += h
loss_plus = 0.5 * (forward_pass(x)['a2'] - target)**2
b1 = original - h
loss_minus = 0.5 * (forward_pass(x)['a2'] - target)**2
b1 = original
elif param_name == 'w2':
original = w2
w2 += h
loss_plus = 0.5 * (forward_pass(x)['a2'] - target)**2
w2 = original - h
loss_minus = 0.5 * (forward_pass(x)['a2'] - target)**2
w2 = original
elif param_name == 'b2':
original = b2
b2 += h
loss_plus = 0.5 * (forward_pass(x)['a2'] - target)**2
b2 = original - h
loss_minus = 0.5 * (forward_pass(x)['a2'] - target)**2
b2 = original
return (loss_plus - loss_minus) / (2 * h)
# Test with a sample input
x_test = 2.0
target = 0.8
# Forward and backward pass
forward_values = forward_pass(x_test)
backward_values = backward_pass(forward_values, target)
print(f"Input: {x_test}, Target: {target}")
print(f"Output: {forward_values['a2']:.4f}")
print(f"Loss: {backward_values['loss']:.4f}")
print("\nGradient Verification:")
gradients = ['dL_dw1', 'dL_db1', 'dL_dw2', 'dL_db2']
param_names = ['w1', 'b1', 'w2', 'b2']
for grad_name, param_name in zip(gradients, param_names):
analytical = backward_values[grad_name]
numerical = numerical_gradient(x_test, target, param_name)
print(f"{param_name}: Analytical = {analytical:.6f}, Numerical = {numerical:.6f}, "
f"Diff = {abs(analytical - numerical):.2e}")
# Visualization of function and gradients
plt.figure(figsize=(15, 5))
# Function output vs input
x_range = np.linspace(-2, 4, 100)
outputs = [forward_pass(x)['a2'] for x in x_range]
plt.subplot(1, 3, 1)
plt.plot(x_range, outputs, 'b-', linewidth=2, label='Network Output')
plt.plot(x_test, forward_values['a2'], 'ro', markersize=8, label=f'Test Point')
plt.axhline(target, color='g', linestyle='--', label='Target')
plt.xlabel('Input (x)')
plt.ylabel('Output')
plt.title('Neural Network Function')
plt.legend()
plt.grid(True)
# Gradient descent on parameters
plt.subplot(1, 3, 2)
# Simulate parameter updates
w1_history = [w1]
w2_history = [w2]
loss_history = [backward_values['loss']]
learning_rate = 0.1
current_w1, current_w2 = w1, b1
for i in range(50):
# Use current parameters
w1, w2 = current_w1, current_w2
forward_vals = forward_pass(x_test)
backward_vals = backward_pass(forward_vals, target)
# Update parameters
current_w1 -= learning_rate * backward_vals['dL_dw1']
current_w2 -= learning_rate * backward_vals['dL_dw2']
w1_history.append(current_w1)
w2_history.append(current_w2)
loss_history.append(backward_vals['loss'])
# Restore original parameters
w1, w2 = 0.5, 0.8
plt.plot(loss_history, 'b-', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Loss Decrease via Gradient Descent')
plt.grid(True)
# Parameter space
plt.subplot(1, 3, 3)
plt.plot(w1_history, w2_history, 'ro-', linewidth=2, markersize=4)
plt.plot(w1_history[0], w2_history[0], 'go', markersize=10, label='Start')
plt.plot(w1_history[-1], w2_history[-1], 'bo', markersize=10, label='End')
plt.xlabel('w1')
plt.ylabel('w2')
plt.title('Parameter Trajectory')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
return {
'forward_pass': forward_values,
'backward_pass': backward_values,
'parameter_history': {
'w1': w1_history,
'w2': w2_history,
'loss': loss_history
}
}
chain_rule_demo = demonstrate_chain_rule()Given training data (xi, yi) for i = 1, ..., n, we want to find parameters w and b that minimize:
$$L(w, b) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - (wx_i + b))^2$$
Analytical Solution: $$w = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$
b = ȳ − wx̄
Gradient Descent: $$\frac{\partial L}{\partial w} = -\frac{1}{n}\sum_{i=1}^{n} x_i(y_i - (wx_i + b))$$
$$\frac{\partial L}{\partial b} = -\frac{1}{n}\sum_{i=1}^{n} (y_i - (wx_i + b))$$
Sigmoid Function: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Probability Model: P(y = 1|x) = σ(wTx + b)
Log-Likelihood: $$\ell(w, b) = \sum_{i=1}^{n} [y_i \log(\sigma(w^Tx_i + b)) + (1-y_i) \log(1-\sigma(w^Tx_i + b))]$$
Gradients: $$\frac{\partial \ell}{\partial w} = \sum_{i=1}^{n} (y_i - \sigma(w^Tx_i + b))x_i$$
$$\frac{\partial \ell}{\partial b} = \sum_{i=1}^{n} (y_i - \sigma(w^Tx_i + b))$$
Primal Problem: $$\min_{w,b,\xi} \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i$$
Subject to: yi(wTxi + b) ≥ 1 − ξi, ξi ≥ 0
Dual Problem: $$\max_{\alpha} \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i\alpha_j y_i y_j x_i^T x_j$$
Subject to: $$0 \leq \alpha_i \leq C, \quad \sum_{i=1}^{n}\alpha_i y_i = 0$$
For a layer l with input a(l − 1) and output a(l):
Forward Pass: z(l) = W(l)a(l − 1) + b(l) a(l) = σ(z(l))
Backward Pass: $$\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}}$$
$$\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)}(a^{(l-1)})^T$$
$$\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}$$
δ(l − 1) = (W(l))Tδ(l) ⊙ σ′(z(l − 1))
| Function | Derivative |
|---|---|
| f(x) = c | f′(x) = 0 |
| f(x) = xn | f′(x) = nxn − 1 |
| f(x) = ex | f′(x) = ex |
| f(x) = ln (x) | $f'(x) = \frac{1}{x}$ |
| f(x) = sin (x) | f′(x) = cos (x) |
| f(x) = cos (x) | f′(x) = −sin (x) |
| $f(x) = \frac{1}{1+e^{-x}}$ | f′(x) = f(x)(1 − f(x)) |
| Distribution | Parameters | Mean | Variance |
|---|---|---|---|
| Normal | μ, σ2 | μ | σ2 |
| Binomial | n, p | np | np(1 − p) |
| Poisson | λ | λ | λ |
| Exponential | λ | $\frac{1}{\lambda}$ | $\frac{1}{\lambda^2}$ |
| Uniform | a, b | $\frac{a+b}{2}$ | $\frac{(b-a)^2}{12}$ |
| Property | Formula |
|---|---|
| (AB)T = BTAT | Transpose of product |
| (A−1)T = (AT)−1 | Inverse transpose |
| det (AB) = det (A)det (B) | Determinant of product |
| tr(A + B) = tr(A) + tr(B) | Trace linearity |
| Av = λv | Eigenvalue equation |
This completes Appendix B with comprehensive mathematical foundations essential for understanding machine learning algorithms, including practical demonstrations and quick reference materials. # Appendix C: Datasets and Resources
This appendix provides a comprehensive collection of datasets, code templates, and resources to support your machine learning journey. From built-in scikit-learn datasets to major public repositories, you’ll find everything needed to practice and implement machine learning algorithms.
Scikit-learn provides several built-in datasets that are perfect for learning and experimentation. These datasets are small, well-curated, and come preloaded with the library.
from sklearn.datasets import load_iris
import pandas as pd
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Convert to DataFrame for easier handling
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['target'] = y
iris_df['target_name'] = iris_df['target'].map({
0: 'setosa', 1: 'versicolor', 2: 'virginica'
})
print(f"Dataset shape: {iris_df.shape}")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")Key Features: - Size: 150 samples, 4 features - Classes: 3 (setosa, versicolor, virginica) - Use Case: Multi-class classification, clustering - Features: Sepal length, sepal width, petal length, petal width
from sklearn.datasets import load_wine
wine = load_wine()
X, y = wine.data, wine.target
print(f"Dataset shape: {X.shape}")
print(f"Number of classes: {len(wine.target_names)}")
print(f"Class distribution: {pd.Series(y).value_counts().sort_index()}")Key Features: - Size: 178 samples, 13 features - Classes: 3 wine cultivars - Use Case: Multi-class classification, feature selection - Features: Chemical analysis results
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {pd.Series(y).value_counts()}")Key Features: - Size: 569 samples, 30 features - Classes: 2 (malignant, benign) - Use Case: Binary classification, medical diagnosis - Features: Cell nuclei characteristics
import warnings
warnings.filterwarnings('ignore') # Deprecated warning
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
print(f"Dataset shape: {X.shape}")
print(f"Target range: {y.min():.2f} to {y.max():.2f}")
print(f"Feature names: {boston.feature_names}")Key Features: - Size: 506 samples, 13 features - Target: Housing prices in $1000s - Use Case: Regression, feature importance analysis - Note: Deprecated due to ethical concerns, use California housing instead
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
X, y = california.data, california.target
print(f"Dataset shape: {X.shape}")
print(f"Target range: {y.min():.2f} to {y.max():.2f}")
print(f"Feature names: {california.feature_names}")Key Features: - Size: 20,640 samples, 8 features - Target: Housing prices in $100,000s - Use Case: Regression, large dataset handling
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
print(f"Dataset shape: {X.shape}")
print(f"Target range: {y.min():.2f} to {y.max():.2f}")Key Features: - Size: 442 samples, 10 features - Target: Diabetes progression measure - Use Case: Regression, medical prediction
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()
X, y = digits.data, digits.target
# Visualize first few digits
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='gray')
ax.set_title(f'Digit: {digits.target[i]}')
ax.axis('off')
plt.tight_layout()Key Features: - Size: 1,797 samples, 64 features (8x8 pixels) - Classes: 10 digits (0-9) - Use Case: Image classification, clustering, dimensionality reduction
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()
X, y = faces.data, faces.target
print(f"Dataset shape: {X.shape}")
print(f"Number of people: {len(set(y))}")Key Features: - Size: 400 samples, 4,096 features (64x64 pixels) - Classes: 40 different people - Use Case: Face recognition, PCA, clustering
from sklearn.datasets import (
make_classification, make_regression, make_blobs,
make_circles, make_moons
)
# Classification dataset
X_class, y_class = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42
)
# Regression dataset
X_reg, y_reg = make_regression(
n_samples=1000, n_features=10, noise=0.1, random_state=42
)
# Clustering dataset
X_blobs, y_blobs = make_blobs(
n_samples=300, centers=4, cluster_std=1.0, random_state=42
)
# Non-linear datasets
X_circles, y_circles = make_circles(
n_samples=1000, noise=0.05, factor=0.6, random_state=42
)
X_moons, y_moons = make_moons(
n_samples=1000, noise=0.1, random_state=42
)Kaggle is one of the most popular platforms for machine learning datasets and competitions.
# Install Kaggle API
# pip install kaggle
# Setup authentication (place kaggle.json in ~/.kaggle/)
import kaggle
# Download a dataset
kaggle.api.dataset_download_files(
'c/titanic',
path='./datasets/raw/',
unzip=True
)
# Load downloaded data
import pandas as pd
titanic = pd.read_csv('./datasets/raw/train.csv')The UCI ML Repository is a collection of databases, domain theories, and datasets widely used by the machine learning community.
import pandas as pd
# Adult Income Dataset (Census Income)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = [
'age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'
]
adult_data = pd.read_csv(url, names=column_names, na_values=' ?')
# Car Evaluation Dataset
car_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
car_columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
car_data = pd.read_csv(car_url, names=car_columns)
# Heart Disease Dataset
heart_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
heart_columns = [
'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'
]
heart_data = pd.read_csv(heart_url, names=heart_columns, na_values='?')Seaborn provides easy access to several interesting datasets:
import seaborn as sns
# Load various datasets
tips = sns.load_dataset('tips')
flights = sns.load_dataset('flights')
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')
mpg = sns.load_dataset('mpg')
diamonds = sns.load_dataset('diamonds')
# List all available datasets
print(sns.get_dataset_names())# CIFAR-10/100 (via TensorFlow/Keras)
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
# MNIST (via TensorFlow/Keras)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
# Fashion-MNIST
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()# 20 Newsgroups Dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Movie Reviews (via NLTK)
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviewsimport pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
class DataPreprocessor:
"""
Comprehensive data preprocessing pipeline
"""
def __init__(self):
self.numeric_imputer = SimpleImputer(strategy='median')
self.categorical_imputer = SimpleImputer(strategy='most_frequent')
self.scaler = StandardScaler()
self.label_encoders = {}
def fit(self, X, y=None):
"""Fit preprocessing transformations"""
# Separate numeric and categorical columns
self.numeric_columns = X.select_dtypes(include=[np.number]).columns
self.categorical_columns = X.select_dtypes(include=['object']).columns
# Fit imputers
if len(self.numeric_columns) > 0:
self.numeric_imputer.fit(X[self.numeric_columns])
if len(self.categorical_columns) > 0:
self.categorical_imputer.fit(X[self.categorical_columns])
# Fit label encoders for categorical variables
for col in self.categorical_columns:
le = LabelEncoder()
# Handle missing values for label encoder
non_null_values = X[col].dropna()
le.fit(non_null_values)
self.label_encoders[col] = le
return self
def transform(self, X):
"""Apply preprocessing transformations"""
X_processed = X.copy()
# Handle numeric columns
if len(self.numeric_columns) > 0:
X_processed[self.numeric_columns] = self.numeric_imputer.transform(
X_processed[self.numeric_columns]
)
X_processed[self.numeric_columns] = self.scaler.fit_transform(
X_processed[self.numeric_columns]
)
# Handle categorical columns
if len(self.categorical_columns) > 0:
X_processed[self.categorical_columns] = self.categorical_imputer.transform(
X_processed[self.categorical_columns]
)
for col in self.categorical_columns:
le = self.label_encoders[col]
# Handle unseen categories
X_processed[col] = X_processed[col].map(
lambda x: le.transform([x])[0] if x in le.classes_ else -1
)
return X_processed
def fit_transform(self, X, y=None):
"""Fit and transform in one step"""
return self.fit(X, y).transform(X)
# Usage example
def preprocess_dataset(df, target_column, test_size=0.2, random_state=42):
"""
Complete preprocessing pipeline for a dataset
"""
# Separate features and target
X = df.drop(columns=[target_column])
y = df[target_column]
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state, stratify=y
)
# Preprocess features
preprocessor = DataPreprocessor()
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# Encode target if categorical
if y.dtype == 'object':
target_encoder = LabelEncoder()
y_train_encoded = target_encoder.fit_transform(y_train)
y_test_encoded = target_encoder.transform(y_test)
else:
y_train_encoded = y_train
y_test_encoded = y_test
target_encoder = None
return {
'X_train': X_train_processed,
'X_test': X_test_processed,
'y_train': y_train_encoded,
'y_test': y_test_encoded,
'preprocessor': preprocessor,
'target_encoder': target_encoder
}import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
class MissingValueHandler:
"""
Comprehensive missing value handling strategies
"""
@staticmethod
def analyze_missing_values(df):
"""Analyze missing value patterns"""
missing_stats = pd.DataFrame({
'Column': df.columns,
'Missing_Count': df.isnull().sum(),
'Missing_Percentage': (df.isnull().sum() / len(df)) * 100,
'Data_Type': df.dtypes
})
missing_stats = missing_stats[missing_stats['Missing_Count'] > 0]
missing_stats = missing_stats.sort_values('Missing_Percentage', ascending=False)
return missing_stats
@staticmethod
def simple_imputation(df, strategy_numeric='median', strategy_categorical='most_frequent'):
"""Simple imputation strategies"""
df_imputed = df.copy()
# Numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
imputer_numeric = SimpleImputer(strategy=strategy_numeric)
df_imputed[numeric_cols] = imputer_numeric.fit_transform(df[numeric_cols])
# Categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
if len(categorical_cols) > 0:
imputer_categorical = SimpleImputer(strategy=strategy_categorical)
df_imputed[categorical_cols] = imputer_categorical.fit_transform(df[categorical_cols])
return df_imputed
@staticmethod
def knn_imputation(df, n_neighbors=5):
"""K-Nearest Neighbors imputation"""
# Encode categorical variables first
df_encoded = df.copy()
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
le = LabelEncoder()
df_encoded[col] = le.fit_transform(df[col].astype(str))
label_encoders[col] = le
# Apply KNN imputation
imputer = KNNImputer(n_neighbors=n_neighbors)
df_imputed = pd.DataFrame(
imputer.fit_transform(df_encoded),
columns=df_encoded.columns,
index=df_encoded.index
)
# Decode categorical variables
for col, le in label_encoders.items():
df_imputed[col] = le.inverse_transform(df_imputed[col].astype(int))
return df_imputed
@staticmethod
def iterative_imputation(df, random_state=42):
"""Iterative (MICE) imputation"""
# Similar encoding process as KNN
df_encoded = df.copy()
label_encoders = {}
for col in df.select_dtypes(include=['object']).columns:
le = LabelEncoder()
df_encoded[col] = le.fit_transform(df[col].astype(str))
label_encoders[col] = le
# Apply iterative imputation
imputer = IterativeImputer(random_state=random_state)
df_imputed = pd.DataFrame(
imputer.fit_transform(df_encoded),
columns=df_encoded.columns,
index=df_encoded.index
)
# Decode categorical variables
for col, le in label_encoders.items():
df_imputed[col] = le.inverse_transform(df_imputed[col].astype(int))
return df_imputedimport pandas as pd
import numpy as np
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures
)
from sklearn.feature_selection import (
SelectKBest, f_classif, f_regression, chi2, mutual_info_classif
)
class FeatureEngineer:
"""
Feature engineering and selection utilities
"""
@staticmethod
def create_polynomial_features(X, degree=2, include_bias=False):
"""Create polynomial features"""
poly = PolynomialFeatures(degree=degree, include_bias=include_bias)
X_poly = poly.fit_transform(X)
feature_names = poly.get_feature_names_out()
return pd.DataFrame(X_poly, columns=feature_names, index=X.index)
@staticmethod
def create_interaction_features(df, columns):
"""Create interaction features between specified columns"""
df_interactions = df.copy()
for i in range(len(columns)):
for j in range(i + 1, len(columns)):
col1, col2 = columns[i], columns[j]
interaction_name = f"{col1}_{col2}_interaction"
df_interactions[interaction_name] = df[col1] * df[col2]
return df_interactions
@staticmethod
def create_binning_features(df, column, bins=5, strategy='equal_width'):
"""Create binned categorical features from continuous variables"""
if strategy == 'equal_width':
df[f"{column}_binned"] = pd.cut(df[column], bins=bins)
elif strategy == 'equal_frequency':
df[f"{column}_binned"] = pd.qcut(df[column], q=bins)
return df
@staticmethod
def select_features_univariate(X, y, k=10, score_func=f_classif):
"""Univariate feature selection"""
selector = SelectKBest(score_func=score_func, k=k)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
scores = selector.scores_
feature_scores = pd.DataFrame({
'Feature': X.columns,
'Score': scores,
'Selected': selector.get_support()
}).sort_values('Score', ascending=False)
return X_selected, selected_features, feature_scores
@staticmethod
def scale_features(X, method='standard'):
"""Scale features using different methods"""
scalers = {
'standard': StandardScaler(),
'minmax': MinMaxScaler(),
'robust': RobustScaler()
}
if method not in scalers:
raise ValueError(f"Method must be one of: {list(scalers.keys())}")
scaler = scalers[method]
X_scaled = scaler.fit_transform(X)
return pd.DataFrame(X_scaled, columns=X.columns, index=X.index), scalerimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def load_and_explore_dataset(file_path, target_column=None):
"""
Load and perform initial exploration of a dataset
"""
# Load data
if file_path.endswith('.csv'):
df = pd.read_csv(file_path)
elif file_path.endswith('.json'):
df = pd.read_json(file_path)
elif file_path.endswith(('.xls', '.xlsx')):
df = pd.read_excel(file_path)
else:
raise ValueError("Unsupported file format")
print("="*50)
print("DATASET OVERVIEW")
print("="*50)
# Basic info
print(f"Dataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Data types
print("\nData Types:")
print(df.dtypes.value_counts())
# Missing values
missing_values = df.isnull().sum()
if missing_values.sum() > 0:
print("\nMissing Values:")
print(missing_values[missing_values > 0])
else:
print("\nNo missing values found!")
# Basic statistics
print("\nNumerical Statistics:")
print(df.describe())
# Categorical overview
categorical_cols = df.select_dtypes(include=['object']).columns
if len(categorical_cols) > 0:
print("\nCategorical Variables:")
for col in categorical_cols:
unique_count = df[col].nunique()
print(f"{col}: {unique_count} unique values")
if unique_count <= 10:
print(f" Values: {df[col].unique()}")
# Target variable analysis
if target_column and target_column in df.columns:
print(f"\nTarget Variable Analysis ({target_column}):")
if df[target_column].dtype in ['object', 'category']:
print(df[target_column].value_counts())
else:
print(df[target_column].describe())
return df
def create_correlation_heatmap(df, figsize=(12, 8)):
"""Create correlation heatmap for numerical variables"""
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) < 2:
print("Not enough numerical columns for correlation analysis")
return
plt.figure(figsize=figsize)
correlation_matrix = df[numeric_cols].corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True,
cmap='coolwarm', center=0, square=True,
linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()
return correlation_matrix
def plot_target_distribution(df, target_column, plot_type='auto'):
"""Plot target variable distribution"""
if target_column not in df.columns:
print(f"Column '{target_column}' not found in dataset")
return
plt.figure(figsize=(10, 6))
if df[target_column].dtype in ['object', 'category'] or plot_type == 'categorical':
# Categorical target
value_counts = df[target_column].value_counts()
plt.subplot(1, 2, 1)
value_counts.plot(kind='bar')
plt.title(f'{target_column} Distribution')
plt.xticks(rotation=45)
plt.subplot(1, 2, 2)
plt.pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%')
plt.title(f'{target_column} Proportion')
else:
# Numerical target
plt.subplot(1, 2, 1)
plt.hist(df[target_column].dropna(), bins=30, edgecolor='black', alpha=0.7)
plt.title(f'{target_column} Distribution')
plt.xlabel(target_column)
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.boxplot(df[target_column].dropna())
plt.title(f'{target_column} Box Plot')
plt.ylabel(target_column)
plt.tight_layout()
plt.show()from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
def train_classification_model(model, X_train, X_test, y_train, y_test,
model_name="Model"):
"""
Train and evaluate a classification model
"""
print(f"Training {model_name}...")
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Calculate scores
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"\n{model_name} Results:")
print(f"Training Accuracy: {train_score:.4f}")
print(f"Testing Accuracy: {test_score:.4f}")
# Classification report
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_test))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'{model_name} Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"\nCross-validation Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
return model
def train_regression_model(model, X_train, X_test, y_train, y_test,
model_name="Model"):
"""
Train and evaluate a regression model
"""
print(f"Training {model_name}...")
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Calculate metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
test_mae = mean_absolute_error(y_test, y_pred_test)
print(f"\n{model_name} Results:")
print(f"Training RMSE: {train_rmse:.4f}")
print(f"Testing RMSE: {test_rmse:.4f}")
print(f"Training R²: {train_r2:.4f}")
print(f"Testing R²: {test_r2:.4f}")
print(f"Testing MAE: {test_mae:.4f}")
# Residual plot
residuals = y_test - y_pred_test
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.scatter(y_pred_test, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.subplot(1, 3, 2)
plt.scatter(y_test, y_pred_test, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted')
plt.subplot(1, 3, 3)
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residuals Distribution')
plt.tight_layout()
plt.show()
return model
def hyperparameter_tuning(model, param_grid, X_train, y_train,
scoring='accuracy', cv=5):
"""
Perform hyperparameter tuning using GridSearchCV
"""
print("Performing hyperparameter tuning...")
grid_search = GridSearchCV(
model, param_grid, scoring=scoring, cv=cv,
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
return grid_search.best_estimator_import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
def create_feature_distribution_plots(df, columns=None, figsize=(15, 10)):
"""Create distribution plots for numerical features"""
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns
n_cols = 3
n_rows = (len(columns) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
axes = axes.flatten() if n_rows > 1 else [axes]
for i, column in enumerate(columns):
if i < len(axes):
axes[i].hist(df[column].dropna(), bins=30, alpha=0.7, edgecolor='black')
axes[i].set_title(f'{column} Distribution')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
# Remove empty subplots
for i in range(len(columns), len(axes)):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
def create_categorical_plots(df, target_column, categorical_columns=None):
"""Create plots for categorical variables vs target"""
if categorical_columns is None:
categorical_columns = df.select_dtypes(include=['object']).columns
categorical_columns = [col for col in categorical_columns if col != target_column]
n_cols = 2
n_rows = (len(categorical_columns) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten() if n_rows > 1 else [axes]
for i, column in enumerate(categorical_columns):
if i < len(axes):
if df[target_column].dtype in ['object', 'category']:
# Categorical target
crosstab = pd.crosstab(df[column], df[target_column])
crosstab.plot(kind='bar', ax=axes[i], stacked=False)
axes[i].set_title(f'{column} vs {target_column}')
axes[i].set_xlabel(column)
axes[i].legend(title=target_column)
else:
# Numerical target
sns.boxplot(data=df, x=column, y=target_column, ax=axes[i])
axes[i].set_title(f'{column} vs {target_column}')
axes[i].tick_params(axis='x', rotation=45)
# Remove empty subplots
for i in range(len(categorical_columns), len(axes)):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
def create_interactive_scatter_plot(df, x_column, y_column, color_column=None):
"""Create interactive scatter plot using Plotly"""
if color_column:
fig = px.scatter(df, x=x_column, y=y_column, color=color_column,
title=f'{y_column} vs {x_column}',
hover_data=df.columns)
else:
fig = px.scatter(df, x=x_column, y=y_column,
title=f'{y_column} vs {x_column}',
hover_data=df.columns)
fig.show()
def plot_model_comparison(models_results, metric_name='Accuracy'):
"""Plot comparison of multiple model performances"""
model_names = list(models_results.keys())
scores = list(models_results.values())
plt.figure(figsize=(10, 6))
bars = plt.bar(model_names, scores, alpha=0.8, color='skyblue', edgecolor='navy')
# Add value labels on bars
for bar, score in zip(bars, scores):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
plt.title(f'Model Performance Comparison ({metric_name})')
plt.ylabel(metric_name)
plt.xlabel('Models')
plt.xticks(rotation=45)
plt.ylim(0, max(scores) * 1.1)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()# CLASSIFICATION ALGORITHMS
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
# REGRESSION ALGORITHMS
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
# CLUSTERING ALGORITHMS
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
# DIMENSIONALITY REDUCTION
from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# MODEL SELECTION AND EVALUATION
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# PREPROCESSING
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import SelectKBest, RFE| Issue | Symptoms | Solutions |
|---|---|---|
| Missing Values | NaN, NULL values | SimpleImputer, KNNImputer, IterativeImputer |
| Outliers | Extreme values | IQR method, Z-score, Isolation Forest |
| Categorical Data | Text/object columns | LabelEncoder, OneHotEncoder, Target Encoding |
| Feature Scale | Different value ranges | StandardScaler, MinMaxScaler, RobustScaler |
| High Dimensionality | Too many features | PCA, Feature Selection, Regularization |
| Imbalanced Classes | Unequal class distribution | SMOTE, Undersampling, Class weights |
| Multicollinearity | Highly correlated features | VIF analysis, PCA, Ridge regression |
| Problem Type | Recommended Algorithms | Key Considerations |
|---|---|---|
| Binary Classification | Logistic Regression, SVM, Random Forest | Interpretability vs Performance |
| Multi-class Classification | Random Forest, Gradient Boosting, Neural Networks | Handle class imbalance |
| Regression | Linear Regression, Random Forest, XGBoost | Linear vs Non-linear relationships |
| Clustering | K-Means, DBSCAN, Hierarchical | Number of clusters, cluster shapes |
| Dimensionality Reduction | PCA, t-SNE, UMAP | Preserve variance vs visualization |
| Time Series | ARIMA, LSTM, Prophet | Seasonality, trend, stationarity |
# Memory optimization techniques
import pandas as pd
def optimize_memory_usage(df):
"""Optimize pandas DataFrame memory usage"""
start_memory = df.memory_usage(deep=True).sum() / 1024**2
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif str(col_type)[:5] == 'float':
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype('category')
end_memory = df.memory_usage(deep=True).sum() / 1024**2
print(f'Memory usage decreased from {start_memory:.2f} MB to {end_memory:.2f} MB')
print(f'({100 * (start_memory - end_memory) / start_memory:.1f}% reduction)')
return df# Project structure template
"""
project/
"�✅✅" data/
"�� "�✅✅" raw/ # Original datasets
"�� "�✅✅" processed/ # Cleaned datasets
"�� "��"�✅" external/ # External data sources
"�✅✅" notebooks/ # Jupyter notebooks
"�✅✅" src/ # Source code
"�� "�✅✅" data/ # Data processing modules
"�� "�✅✅" features/ # Feature engineering
"�� "�✅✅" models/ # Model definitions
"�� "��"�✅" utils/ # Utility functions
"�✅✅" tests/ # Unit tests
"�✅✅" requirements.txt # Dependencies
"��"�✅" README.md # Project documentation
"""
# Configuration management
import os
from dataclasses import dataclass
@dataclass
class Config:
"""Project configuration"""
PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(PROJECT_ROOT, 'data')
RAW_DATA_DIR = os.path.join(DATA_DIR, 'raw')
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, 'processed')
MODELS_DIR = os.path.join(PROJECT_ROOT, 'models')
# Model parameters
RANDOM_STATE = 42
TEST_SIZE = 0.2
CV_FOLDS = 5
# Preprocessing parameters
NUMERIC_STRATEGY = 'median'
CATEGORICAL_STRATEGY = 'most_frequent'This comprehensive appendix provides you with all the essential datasets, templates, and code snippets needed for your machine learning projects. Use these resources as starting points and adapt them to your specific needs and datasets.
Remember: The key to successful machine learning is not just having the right tools, but understanding when and how to use them effectively. Always start with thorough data exploration and choose methods that align with your problem domain and data characteristics.# Appendix D: Evaluation Metrics Reference
This appendix provides a comprehensive reference for all evaluation metrics used in machine learning. Understanding when and how to use each metric is crucial for properly assessing model performance and making informed decisions about model selection and optimization.
Classification metrics evaluate how well a model predicts categorical outcomes. The choice of metric depends on the problem type, class distribution, and business objectives.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def plot_confusion_matrix(y_true, y_pred, classes=None, title='Confusion Matrix'):
"""
Plot confusion matrix with proper formatting
"""
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=classes, yticklabels=classes)
plt.title(title)
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
return cm
# For binary classification confusion matrix:
# [[TN, FP],
# [FN, TP]](TP + TN) / (TP + TN + FP + FN)from sklearn.metrics import accuracy_score
def calculate_accuracy(y_true, y_pred):
return accuracy_score(y_true, y_pred)
# When to use:
# ✅ Balanced datasets
# ✅ Equal misclassification costs
# ✅ Imbalanced datasets
# ✅ When specific class performance mattersTP / (TP + FP)from sklearn.metrics import precision_score
def calculate_precision(y_true, y_pred, average='binary'):
return precision_score(y_true, y_pred, average=average)
# When to use:
# ✅ When false positives are costly
# ✅ Spam detection (don't want to mark good emails as spam)
# ✅ Medical diagnosis (don't want false alarms)TP / (TP + FN)from sklearn.metrics import recall_score
def calculate_recall(y_true, y_pred, average='binary'):
return recall_score(y_true, y_pred, average=average)
# When to use:
# ✅ When false negatives are costly
# ✅ Disease detection (don't want to miss cases)
# ✅ Fraud detection (don't want to miss fraudulent transactions)TN / (TN + FP)def calculate_specificity(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()
return tn / (tn + fp)
# When to use:
# ✅ When correctly identifying negatives is important
# ✅ Screening tests (identifying healthy individuals)2 * (Precision * Recall) / (Precision + Recall)from sklearn.metrics import f1_score
def calculate_f1(y_true, y_pred, average='binary'):
return f1_score(y_true, y_pred, average=average)
# When to use:
# ✅ Imbalanced datasets
# ✅ When you need balance between precision and recall
# ✅ Single metric for model comparison(1 + β²) * (Precision * Recall) / (β² * Precision + Recall)from sklearn.metrics import fbeta_score
def calculate_fbeta(y_true, y_pred, beta=1.0):
return fbeta_score(y_true, y_pred, beta=beta)
# Beta values:
# β < 1: Emphasizes precision
# β > 1: Emphasizes recall
# β = 1: Equal weight (F1-score)
# Example usage:
f05_score = calculate_fbeta(y_true, y_pred, beta=0.5) # Favor precision
f2_score = calculate_fbeta(y_true, y_pred, beta=2.0) # Favor recallfrom sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
def plot_roc_curve(y_true, y_pred_proba):
fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
auc_score = roc_auc_score(y_true, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True)
plt.show()
return auc_score
# When to use:
# ✅ Binary classification
# ✅ When you want threshold-independent metric
# ✅ Balanced datasets
# ✅ Highly imbalanced datasetsfrom sklearn.metrics import precision_recall_curve, average_precision_score
def plot_precision_recall_curve(y_true, y_pred_proba):
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_proba)
ap_score = average_precision_score(y_true, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'PR Curve (AP = {ap_score:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True)
plt.show()
return ap_score
# When to use:
# ✅ Imbalanced datasets
# ✅ When positive class is more important
# ✅ Information retrieval tasks-Σ(y_true * log(y_pred) + (1-y_true) * log(1-y_pred))from sklearn.metrics import log_loss
def calculate_log_loss(y_true, y_pred_proba):
return log_loss(y_true, y_pred_proba)
# When to use:
# ✅ When you have probability predictions
# ✅ Calibrated probability estimates are important
# ✅ Multi-class classification(p_o - p_e) / (1 - p_e)from sklearn.metrics import cohen_kappa_score
def calculate_kappa(y_true, y_pred):
return cohen_kappa_score(y_true, y_pred)
# Interpretation:
# < 0: Less than chance agreement
# 0.01-0.20: Slight agreement
# 0.21-0.40: Fair agreement
# 0.41-0.60: Moderate agreement
# 0.61-0.80: Substantial agreement
# 0.81-1.00: Almost perfect agreement(TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))from sklearn.metrics import matthews_corrcoef
def calculate_mcc(y_true, y_pred):
return matthews_corrcoef(y_true, y_pred)
# When to use:
# ✅ Imbalanced datasets
# ✅ Binary classification
# ✅ When all confusion matrix elements are importantfrom sklearn.metrics import classification_report
def comprehensive_classification_report(y_true, y_pred, target_names=None):
"""
Generate comprehensive classification report
"""
report = classification_report(y_true, y_pred, target_names=target_names, output_dict=True)
print("="*60)
print("COMPREHENSIVE CLASSIFICATION REPORT")
print("="*60)
# Overall metrics
print(f"Accuracy: {report['accuracy']:.4f}")
print(f"Macro Average F1: {report['macro avg']['f1-score']:.4f}")
print(f"Weighted Average F1: {report['weighted avg']['f1-score']:.4f}")
# Per-class metrics
print("\nPer-Class Metrics:")
print("-" * 50)
print(f"{'Class':<15} {'Precision':<10} {'Recall':<10} {'F1-Score':<10} {'Support':<10}")
print("-" * 50)
for class_name, metrics in report.items():
if class_name not in ['accuracy', 'macro avg', 'weighted avg']:
print(f"{class_name:<15} {metrics['precision']:<10.4f} "
f"{metrics['recall']:<10.4f} {metrics['f1-score']:<10.4f} "
f"{int(metrics['support']):<10}")
return report
# Averaging strategies for multi-class:
# - 'micro': Calculate globally (good for imbalanced datasets)
# - 'macro': Calculate per class then average (treats all classes equally)
# - 'weighted': Calculate per class, weighted by supportRegression metrics evaluate how well a model predicts continuous numerical values. The choice depends on the scale of your target variable, presence of outliers, and interpretability requirements.
Σ|y_true - y_pred| / nfrom sklearn.metrics import mean_absolute_error
def calculate_mae(y_true, y_pred):
return mean_absolute_error(y_true, y_pred)
# When to use:
# ✅ Robust to outliers
# ✅ Interpretable (same units as target)
# ✅ When all errors are equally importantΣ(y_true - y_pred)² / nfrom sklearn.metrics import mean_squared_error
def calculate_mse(y_true, y_pred):
return mean_squared_error(y_true, y_pred)
# When to use:
# ✅ Penalizes large errors more heavily
# ✅ Differentiable (good for optimization)
# ✅ Sensitive to outliers
# ✅ Units are squared"��(Σ(y_true - y_pred)² / n)import numpy as np
def calculate_rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
# When to use:
# ✅ Most common regression metric
# ✅ Interpretable units
# ✅ Penalizes large errors
# ✅ Sensitive to outliers1 - (SS_res / SS_tot)from sklearn.metrics import r2_score
def calculate_r2(y_true, y_pred):
return r2_score(y_true, y_pred)
# Interpretation:
# R² = 1: Perfect predictions
# R² = 0: Model performs as well as mean baseline
# R² < 0: Model performs worse than mean baseline
# When to use:
# ✅ Model comparison
# ✅ Proportion of variance explained
# ✅ Doesn't indicate absolute quality
# ✅ Can be misleading with non-linear relationships1 - ((1-R²)(n-1)/(n-k-1))def calculate_adjusted_r2(y_true, y_pred, n_features):
r2 = r2_score(y_true, y_pred)
n = len(y_true)
adjusted_r2 = 1 - ((1 - r2) * (n - 1)) / (n - n_features - 1)
return adjusted_r2
# When to use:
# ✅ Comparing models with different numbers of features
# ✅ Prevents overfitting due to too many features(100/n) * Σ|((y_true - y_pred) / y_true)|def calculate_mape(y_true, y_pred, epsilon=1e-8):
# Add epsilon to avoid division by zero
return np.mean(np.abs((y_true - y_pred) / (y_true + epsilon))) * 100
# When to use:
# ✅ Scale-independent comparison
# ✅ Easy to interpret (percentage)
# ✅ Problematic when y_true contains zeros or small values
# ✅ Asymmetric (over-prediction penalized less)(100/n) * Σ(|y_pred - y_true| / ((|y_true| + |y_pred|)/2))def calculate_smape(y_true, y_pred):
denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
return np.mean(np.abs(y_pred - y_true) / denominator) * 100
# When to use:
# ✅ Symmetric penalty for over and under-prediction
# ✅ Scale-independent
# ✅ Bounded between 0 and 200%MAE / MAE_naivedef calculate_mase(y_true, y_pred, y_train):
# Calculate naive forecast error (seasonal naive for time series)
mae_model = mean_absolute_error(y_true, y_pred)
mae_naive = mean_absolute_error(y_train[1:], y_train[:-1]) # Simple naive
return mae_model / mae_naive
# Interpretation:
# MASE < 1: Better than naive forecast
# MASE = 1: Same as naive forecast
# MASE > 1: Worse than naive forecastfrom sklearn.metrics import mean_squared_error
def calculate_huber_loss(y_true, y_pred, delta=1.0):
residual = np.abs(y_true - y_pred)
condition = residual <= delta
squared_loss = 0.5 * (residual ** 2)
linear_loss = delta * residual - 0.5 * (delta ** 2)
return np.mean(np.where(condition, squared_loss, linear_loss))
# When to use:
# ✅ Robust to outliers
# ✅ Differentiable everywhere
# ✅ Good balance between MSE and MAEimport pandas as pd
import matplotlib.pyplot as plt
def comprehensive_regression_report(y_true, y_pred, model_name="Model"):
"""
Generate comprehensive regression evaluation report
"""
print("="*60)
print(f"COMPREHENSIVE REGRESSION REPORT - {model_name}")
print("="*60)
# Calculate all metrics
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
mape = calculate_mape(y_true, y_pred)
# Display metrics
metrics = {
'MAE': mae,
'MSE': mse,
'RMSE': rmse,
'R²': r2,
'MAPE (%)': mape
}
for metric, value in metrics.items():
print(f"{metric:<15}: {value:.6f}")
# Residual analysis
residuals = y_true - y_pred
print(f"\nResidual Analysis:")
print(f"Mean Residual : {np.mean(residuals):.6f}")
print(f"Std Residual : {np.std(residuals):.6f}")
print(f"Min Residual : {np.min(residuals):.6f}")
print(f"Max Residual : {np.max(residuals):.6f}")
# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Actual vs Predicted
axes[0, 0].scatter(y_true, y_pred, alpha=0.6)
min_val, max_val = min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())
axes[0, 0].plot([min_val, max_val], [min_val, max_val], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Values')
axes[0, 0].set_ylabel('Predicted Values')
axes[0, 0].set_title('Actual vs Predicted')
axes[0, 0].grid(True, alpha=0.3)
# Residuals vs Predicted
axes[0, 1].scatter(y_pred, residuals, alpha=0.6)
axes[0, 1].axhline(y=0, color='r', linestyle='--')
axes[0, 1].set_xlabel('Predicted Values')
axes[0, 1].set_ylabel('Residuals')
axes[0, 1].set_title('Residuals vs Predicted')
axes[0, 1].grid(True, alpha=0.3)
# Residual Distribution
axes[1, 0].hist(residuals, bins=30, alpha=0.7, edgecolor='black')
axes[1, 0].axvline(x=0, color='r', linestyle='--')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Residual Distribution')
# Q-Q Plot for residual normality
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Residual Normality)')
plt.tight_layout()
plt.show()
return metricsClustering evaluation is more challenging than supervised learning because there’s no ground truth. We use both internal measures (based on the data itself) and external measures (when true labels are available).
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
def calculate_silhouette_analysis(X, cluster_labels, n_clusters):
"""
Comprehensive silhouette analysis
"""
# Overall silhouette score
avg_silhouette = silhouette_score(X, cluster_labels)
# Per-sample silhouette scores
sample_silhouette_values = silhouette_samples(X, cluster_labels)
# Create silhouette plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Silhouette plot
y_lower = 10
for i in range(n_clusters):
cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
cluster_silhouette_values.sort()
size_cluster_i = cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = plt.cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
ax1.axvline(x=avg_silhouette, color="red", linestyle="--")
ax1.set_xlabel('Silhouette coefficient values')
ax1.set_ylabel('Cluster label')
ax1.set_title('Silhouette Plot')
# Scatter plot of clusters
colors = plt.cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors)
ax2.set_title('Clustered Data')
plt.show()
print(f"Average Silhouette Score: {avg_silhouette:.4f}")
# Interpretation:
if avg_silhouette > 0.7:
print("Strong cluster structure")
elif avg_silhouette > 0.5:
print("Reasonable cluster structure")
elif avg_silhouette > 0.25:
print("Weak cluster structure")
else:
print("No substantial cluster structure")
return avg_silhouette
# When to use:
# ✅ Comparing different clustering algorithms
# ✅ Determining optimal number of clusters
# ✅ Dense, well-separated clusters
# ✅ Different cluster densities or sizesfrom sklearn.metrics import calinski_harabasz_score
def calculate_calinski_harabasz(X, cluster_labels):
score = calinski_harabasz_score(X, cluster_labels)
print(f"Calinski-Harabasz Score: {score:.4f}")
return score
# When to use:
# ✅ Convex clusters
# ✅ Similar cluster sizes
# ✅ Fast computationfrom sklearn.metrics import davies_bouldin_score
def calculate_davies_bouldin(X, cluster_labels):
score = davies_bouldin_score(X, cluster_labels)
print(f"Davies-Bouldin Score: {score:.4f}")
return score
# When to use:
# ✅ Convex clusters
# ✅ Similar cluster sizes
# ✅ When lower values indicate better clusteringdef calculate_inertia(X, cluster_labels, centroids):
"""
Calculate within-cluster sum of squares
"""
inertia = 0
for i in range(len(centroids)):
cluster_points = X[cluster_labels == i]
if len(cluster_points) > 0:
inertia += np.sum((cluster_points - centroids[i]) ** 2)
return inertia
def elbow_method_analysis(X, max_k=10):
"""
Perform elbow method analysis for optimal k
"""
from sklearn.cluster import KMeans
inertias = []
k_range = range(1, max_k + 1)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()
return k_range, inertiasThese metrics are used when true cluster labels are available (for validation purposes).
from sklearn.metrics import adjusted_rand_score
def calculate_ari(true_labels, pred_labels):
ari = adjusted_rand_score(true_labels, pred_labels)
print(f"Adjusted Rand Index: {ari:.4f}")
# Interpretation
if ari > 0.9:
print("Excellent clustering")
elif ari > 0.7:
print("Good clustering")
elif ari > 0.5:
print("Moderate clustering")
else:
print("Poor clustering")
return arifrom sklearn.metrics import normalized_mutual_info_score
def calculate_nmi(true_labels, pred_labels):
nmi = normalized_mutual_info_score(true_labels, pred_labels)
print(f"Normalized Mutual Information: {nmi:.4f}")
return nmifrom sklearn.metrics import homogeneity_completeness_v_measure
def calculate_clustering_metrics(true_labels, pred_labels):
homogeneity, completeness, v_measure = homogeneity_completeness_v_measure(
true_labels, pred_labels
)
print(f"Homogeneity: {homogeneity:.4f}")
print(f"Completeness: {completeness:.4f}")
print(f"V-measure: {v_measure:.4f}")
return homogeneity, completeness, v_measuredef comprehensive_clustering_evaluation(X, cluster_labels, true_labels=None,
centroids=None, n_clusters=None):
"""
Comprehensive clustering evaluation with all relevant metrics
"""
print("="*60)
print("COMPREHENSIVE CLUSTERING EVALUATION")
print("="*60)
# Internal metrics
print("Internal Metrics (no ground truth needed):")
print("-" * 40)
silhouette = silhouette_score(X, cluster_labels)
calinski = calinski_harabasz_score(X, cluster_labels)
davies = davies_bouldin_score(X, cluster_labels)
print(f"Silhouette Score : {silhouette:.4f}")
print(f"Calinski-Harabasz : {calinski:.4f}")
print(f"Davies-Bouldin : {davies:.4f}")
if centroids is not None:
inertia = calculate_inertia(X, cluster_labels, centroids)
print(f"Inertia (WCSS) : {inertia:.4f}")
# External metrics (if ground truth available)
if true_labels is not None:
print(f"\nExternal Metrics (with ground truth):")
print("-" * 40)
ari = adjusted_rand_score(true_labels, cluster_labels)
nmi = normalized_mutual_info_score(true_labels, cluster_labels)
homogeneity, completeness, v_measure = homogeneity_completeness_v_measure(
true_labels, cluster_labels
)
print(f"Adjusted Rand Index : {ari:.4f}")
print(f"Normalized Mutual Info: {nmi:.4f}")
print(f"Homogeneity : {homogeneity:.4f}")
print(f"Completeness : {completeness:.4f}")
print(f"V-measure : {v_measure:.4f}")
# Cluster statistics
unique_labels, counts = np.unique(cluster_labels, return_counts=True)
print(f"\nCluster Statistics:")
print("-" * 20)
print(f"Number of clusters : {len(unique_labels)}")
print(f"Cluster sizes : {dict(zip(unique_labels, counts))}")
print(f"Largest cluster : {counts.max()} samples")
print(f"Smallest cluster : {counts.min()} samples")
print(f"Avg cluster size : {counts.mean():.1f} samples")
return {
'silhouette': silhouette,
'calinski_harabasz': calinski,
'davies_bouldin': davies,
'ari': ari if true_labels is not None else None,
'nmi': nmi if true_labels is not None else None,
'v_measure': v_measure if true_labels is not None else None
}| Scenario | Recommended Metrics | Reasoning |
|---|---|---|
| Balanced Dataset | Accuracy, F1-Score | All classes equally important |
| Imbalanced Dataset | Precision, Recall, F1, AUC-PR | Focus on minority class performance |
| Medical Diagnosis | Recall (Sensitivity), Specificity | Minimize false negatives and false positives |
| Spam Detection | Precision, Specificity | Minimize false positives (good emails marked as spam) |
| Fraud Detection | Recall, F1-Score | Don’t miss fraudulent transactions |
| Information Retrieval | Precision@K, Recall@K, MAP | Relevant results at top positions |
| Multi-class Problems | Macro/Micro F1, Cohen’s Kappa | Handle class imbalances appropriately |
| Probability Calibration | Log Loss, Brier Score | Well-calibrated probability estimates |
| Scenario | Recommended Metrics | Reasoning |
|---|---|---|
| General Regression | RMSE, MAE, R² | Standard metrics for most cases |
| Outliers Present | MAE, Huber Loss | Robust to extreme values |
| Different Scales | MAPE, SMAPE | Scale-independent comparison |
| Time Series | MASE, sMAPE | Account for seasonal patterns |
| Model Comparison | Adjusted R², AIC, BIC | Penalize model complexity |
| Business Impact | Domain-specific metrics | Custom metrics aligned with business goals |
| Interpretability | MAE, MAPE | Easy to explain to stakeholders |
| Scenario | Recommended Metrics | Reasoning |
|---|---|---|
| No Ground Truth | Silhouette, Calinski-Harabasz | Internal validation only |
| Known True Clusters | ARI, NMI, V-measure | External validation available |
| Spherical Clusters | Inertia, K-means metrics | Assumes convex cluster shapes |
| Arbitrary Shapes | Silhouette, DBSCAN metrics | Handle non-convex clusters |
| Different Sizes | Silhouette analysis | Individual cluster quality |
| Hierarchical Clustering | Cophenetic correlation | Preserve hierarchical structure |
def cost_sensitive_evaluation(y_true, y_pred, cost_matrix):
"""
Evaluate model performance considering business costs
cost_matrix: [[TN_cost, FP_cost],
[FN_cost, TP_cost]]
"""
cm = confusion_matrix(y_true, y_pred)
total_cost = np.sum(cm * cost_matrix)
print(f"Confusion Matrix:")
print(cm)
print(f"Cost Matrix:")
print(cost_matrix)
print(f"Total Cost: {total_cost}")
return total_cost
# Example: Medical diagnosis where false negatives are 10x more costly
medical_cost_matrix = np.array([[0, 1], # TN=0, FP=1
[10, 0]]) # FN=10, TP=0
# Example: Marketing where false positives waste money
marketing_cost_matrix = np.array([[0, 5], # TN=0, FP=5
[1, -2]]) # FN=1, TP=-2 (profit)def custom_metric_example(y_true, y_pred):
"""
Example: Revenue-based metric for recommendation systems
"""
# Assume higher predicted values lead to higher revenue
revenue_per_unit = 10
cost_per_prediction = 0.1
# Calculate revenue from correct high-value predictions
high_value_threshold = 0.7
high_value_correct = ((y_pred > high_value_threshold) & (y_true > high_value_threshold)).sum()
total_revenue = high_value_correct * revenue_per_unit
total_cost = len(y_pred) * cost_per_prediction
net_profit = total_revenue - total_cost
return {
'total_revenue': total_revenue,
'total_cost': total_cost,
'net_profit': net_profit,
'roi': net_profit / total_cost if total_cost > 0 else 0
}def demonstrate_accuracy_paradox():
"""
Show how accuracy can be misleading with imbalanced data
"""
# Highly imbalanced dataset (1% positive class)
y_true = np.concatenate([np.ones(10), np.zeros(990)])
# Naive classifier that always predicts negative
y_pred_naive = np.zeros(1000)
# Slightly better classifier
y_pred_better = np.concatenate([np.ones(8), np.zeros(992)])
print("Accuracy Paradox Demonstration:")
print(f"Naive classifier accuracy: {accuracy_score(y_true, y_pred_naive):.3f}")
print(f"Better classifier accuracy: {accuracy_score(y_true, y_pred_better):.3f}")
print(f"Naive classifier F1: {f1_score(y_true, y_pred_naive):.3f}")
print(f"Better classifier F1: {f1_score(y_true, y_pred_better):.3f}")
# The naive classifier has high accuracy but zero F1-score!def demonstrate_simpsons_paradox():
"""
Show how aggregate metrics can be misleading
"""
# Two groups with different characteristics
group1_true = np.array([1, 1, 1, 0, 0])
group1_pred = np.array([1, 1, 0, 0, 0])
group2_true = np.array([1, 0, 0, 0, 0])
group2_pred = np.array([1, 0, 0, 0, 0])
# Individual group performance
g1_acc = accuracy_score(group1_true, group1_pred)
g2_acc = accuracy_score(group2_true, group2_pred)
# Combined performance
combined_true = np.concatenate([group1_true, group2_true])
combined_pred = np.concatenate([group1_pred, group2_pred])
combined_acc = accuracy_score(combined_true, combined_pred)
print("Simpson's Paradox in Metrics:")
print(f"Group 1 accuracy: {g1_acc:.3f}")
print(f"Group 2 accuracy: {g2_acc:.3f}")
print(f"Combined accuracy: {combined_acc:.3f}")
print(f"Average of group accuracies: {(g1_acc + g2_acc)/2:.3f}")def robust_model_evaluation(models, X_test, y_test, cv_folds=5):
"""
Robust evaluation using multiple metrics and statistical testing
"""
from scipy import stats
results = {}
for model_name, model in models.items():
# Cross-validation scores
cv_scores = cross_val_score(model, X_test, y_test, cv=cv_folds)
# Bootstrap confidence intervals
bootstrap_scores = []
for _ in range(100):
indices = np.random.choice(len(y_test), len(y_test), replace=True)
X_boot, y_boot = X_test[indices], y_test[indices]
y_pred_boot = model.predict(X_boot)
bootstrap_scores.append(accuracy_score(y_boot, y_pred_boot))
results[model_name] = {
'cv_mean': np.mean(cv_scores),
'cv_std': np.std(cv_scores),
'bootstrap_ci': np.percentile(bootstrap_scores, [2.5, 97.5])
}
return resultsThis comprehensive metrics reference provides you with the tools to properly evaluate machine learning models across all domains. Remember that the choice of metrics should always align with your specific problem context, business objectives, and the characteristics of your data.
Key Takeaway: There is no single “best” metric. The art of machine learning evaluation lies in selecting the right combination of metrics that tell the complete story of your model’s performance.# Appendix E: Industry Applications
This appendix provides comprehensive coverage of machine learning applications across various industries, including real-world case studies, implementation strategies, and industry-specific considerations.
Machine learning has revolutionized medical imaging, enabling automated detection and diagnosis of various conditions.
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
class MedicalImageClassifier:
"""
A CNN-based classifier for medical image analysis
"""
def __init__(self, input_shape=(224, 224, 3), num_classes=2):
self.input_shape = input_shape
self.num_classes = num_classes
self.model = self._build_model()
def _build_model(self):
"""Build CNN architecture for medical image classification"""
model = models.Sequential([
# Feature extraction layers
layers.Conv2D(32, (3, 3), activation='relu', input_shape=self.input_shape),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu'),
layers.BatchNormalization(),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(256, (3, 3), activation='relu'),
layers.BatchNormalization(),
layers.GlobalAveragePooling2D(),
# Classification layers
layers.Dense(512, activation='relu'),
layers.Dropout(0.5),
layers.Dense(self.num_classes, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy', 'precision', 'recall']
)
return model
def train(self, train_data, validation_data, epochs=50):
"""Train the model with medical imaging data"""
callbacks = [
tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5),
tf.keras.callbacks.ModelCheckpoint('best_medical_model.h5', save_best_only=True)
]
history = self.model.fit(
train_data,
validation_data=validation_data,
epochs=epochs,
callbacks=callbacks
)
return history
def evaluate_clinical_metrics(self, test_data, test_labels):
"""Evaluate model with clinical metrics"""
predictions = self.model.predict(test_data)
y_pred = np.argmax(predictions, axis=1)
y_true = np.argmax(test_labels, axis=1)
# Clinical evaluation metrics
report = classification_report(y_true, y_pred,
target_names=['Normal', 'Abnormal'])
cm = confusion_matrix(y_true, y_pred)
# Calculate sensitivity and specificity
tn, fp, fn, tp = cm.ravel()
sensitivity = tp / (tp + fn) # True Positive Rate
specificity = tn / (tn + fp) # True Negative Rate
return {
'classification_report': report,
'confusion_matrix': cm,
'sensitivity': sensitivity,
'specificity': specificity,
'predictions': predictions
}
# Example usage for chest X-ray classification
def chest_xray_classification_pipeline():
"""Complete pipeline for chest X-ray classification"""
# Data preprocessing for medical images
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Medical image augmentation (conservative for clinical data)
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=10, # Conservative rotation for medical images
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=False, # Usually not appropriate for medical images
zoom_range=0.1,
validation_split=0.2
)
# Load and prepare data
train_generator = train_datagen.flow_from_directory(
'chest_xray_data/train',
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
subset='training'
)
validation_generator = train_datagen.flow_from_directory(
'chest_xray_data/train',
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
subset='validation'
)
# Initialize and train classifier
classifier = MedicalImageClassifier()
history = classifier.train(train_generator, validation_generator)
return classifier, history
# Drug Discovery and Development
class DrugDiscoveryML:
"""Machine learning models for drug discovery applications"""
def __init__(self):
self.molecular_model = None
self.toxicity_model = None
def molecular_property_prediction(self, smiles_data, properties):
"""Predict molecular properties from SMILES strings"""
from rdkit import Chem
from rdkit.Chem import Descriptors
import pandas as pd
# Extract molecular features
features = []
for smiles in smiles_data:
mol = Chem.MolFromSmiles(smiles)
if mol:
desc = {
'MW': Descriptors.MolWt(mol),
'LogP': Descriptors.MolLogP(mol),
'HBA': Descriptors.NumHAcceptors(mol),
'HBD': Descriptors.NumHDonors(mol),
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Descriptors.NumRotatableBonds(mol)
}
features.append(desc)
features_df = pd.DataFrame(features)
# Train regression model for property prediction
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
self.molecular_model = RandomForestRegressor(n_estimators=100, random_state=42)
# Cross-validation for model evaluation
cv_scores = cross_val_score(self.molecular_model, features_df, properties,
cv=5, scoring='r2')
self.molecular_model.fit(features_df, properties)
return cv_scores.mean(), features_df
def toxicity_screening(self, molecular_features, toxicity_labels):
"""Screen compounds for potential toxicity"""
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve
self.toxicity_model = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=6,
random_state=42
)
self.toxicity_model.fit(molecular_features, toxicity_labels)
# Calculate AUC for toxicity prediction
predictions = self.toxicity_model.predict_proba(molecular_features)[:, 1]
auc_score = roc_auc_score(toxicity_labels, predictions)
return auc_score, predictionsclass EHRAnalytics:
"""Analytics pipeline for Electronic Health Records"""
def __init__(self):
self.risk_model = None
self.readmission_model = None
def preprocess_ehr_data(self, ehr_df):
"""Preprocess EHR data for machine learning"""
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Handle missing values
ehr_df = ehr_df.fillna(method='forward') # Forward fill for time series
ehr_df = ehr_df.fillna(0) # Fill remaining with 0
# Encode categorical variables
categorical_cols = ehr_df.select_dtypes(include=['object']).columns
le = LabelEncoder()
for col in categorical_cols:
ehr_df[col] = le.fit_transform(ehr_df[col].astype(str))
# Feature engineering
ehr_df['age_group'] = pd.cut(ehr_df['age'], bins=[0, 30, 50, 70, 100],
labels=['young', 'adult', 'senior', 'elderly'])
ehr_df['bmi_category'] = pd.cut(ehr_df['bmi'],
bins=[0, 18.5, 25, 30, 100],
labels=['underweight', 'normal', 'overweight', 'obese'])
return ehr_df
def build_risk_stratification_model(self, features, outcomes):
"""Build model for patient risk stratification"""
from sklearn.ensemble import XGBClassifier
from sklearn.model_selection import GridSearchCV
# Hyperparameter tuning
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2]
}
self.risk_model = XGBClassifier(random_state=42)
grid_search = GridSearchCV(
self.risk_model, param_grid,
cv=5, scoring='roc_auc', n_jobs=-1
)
grid_search.fit(features, outcomes)
self.risk_model = grid_search.best_estimator_
return grid_search.best_score_
def predict_readmission_risk(self, patient_data):
"""Predict 30-day readmission risk"""
from sklearn.ensemble import RandomForestClassifier
# Features specific to readmission prediction
readmission_features = [
'length_of_stay', 'num_diagnoses', 'num_procedures',
'num_medications', 'discharge_disposition', 'admission_source'
]
if self.readmission_model is None:
self.readmission_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
# Return risk probability
risk_prob = self.readmission_model.predict_proba(patient_data)[:, 1]
return risk_probThe financial industry leverages ML extensively for detecting fraudulent transactions and preventing financial crimes.
class FraudDetectionSystem:
"""Comprehensive fraud detection system for financial transactions"""
def __init__(self):
self.anomaly_detector = None
self.fraud_classifier = None
self.feature_scaler = None
def engineer_transaction_features(self, transactions_df):
"""Create features for fraud detection"""
import pandas as pd
import numpy as np
# Time-based features
transactions_df['hour'] = pd.to_datetime(transactions_df['timestamp']).dt.hour
transactions_df['day_of_week'] = pd.to_datetime(transactions_df['timestamp']).dt.dayofweek
transactions_df['is_weekend'] = transactions_df['day_of_week'].isin([5, 6])
# Amount-based features
transactions_df['amount_log'] = np.log1p(transactions_df['amount'])
transactions_df['amount_zscore'] = (transactions_df['amount'] -
transactions_df['amount'].mean()) / transactions_df['amount'].std()
# Customer behavior features
customer_stats = transactions_df.groupby('customer_id').agg({
'amount': ['mean', 'std', 'count'],
'merchant_category': lambda x: x.nunique()
}).reset_index()
customer_stats.columns = ['customer_id', 'avg_amount', 'std_amount',
'transaction_count', 'unique_merchants']
transactions_df = transactions_df.merge(customer_stats, on='customer_id')
# Deviation from normal behavior
transactions_df['amount_deviation'] = (transactions_df['amount'] -
transactions_df['avg_amount']) / transactions_df['std_amount']
# Velocity features (transactions in last hour/day)
transactions_df = transactions_df.sort_values('timestamp')
transactions_df['transactions_last_hour'] = transactions_df.groupby('customer_id')['timestamp'].transform(
lambda x: x.rolling('1H').count()
)
return transactions_df
def build_anomaly_detector(self, normal_transactions):
"""Build unsupervised anomaly detection model"""
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
# Feature scaling
self.feature_scaler = StandardScaler()
scaled_features = self.feature_scaler.fit_transform(normal_transactions)
# Isolation Forest for anomaly detection
self.anomaly_detector = IsolationForest(
contamination=0.1, # Expected fraud rate
random_state=42,
n_estimators=100
)
self.anomaly_detector.fit(scaled_features)
return self.anomaly_detector
def build_fraud_classifier(self, features, labels):
"""Build supervised fraud classification model"""
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
# Handle class imbalance with SMOTE
smote = SMOTE(random_state=42)
features_balanced, labels_balanced = smote.fit_resample(features, labels)
# Gradient Boosting Classifier
self.fraud_classifier = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=6,
random_state=42
)
# Cross-validation
cv_scores = cross_val_score(
self.fraud_classifier, features_balanced, labels_balanced,
cv=5, scoring='f1'
)
self.fraud_classifier.fit(features_balanced, labels_balanced)
return cv_scores.mean()
def real_time_fraud_scoring(self, transaction):
"""Real-time fraud scoring for incoming transactions"""
import numpy as np
# Engineer features for single transaction
transaction_features = self.engineer_transaction_features(transaction)
scaled_features = self.feature_scaler.transform(transaction_features)
# Anomaly score
anomaly_score = self.anomaly_detector.decision_function(scaled_features)[0]
# Fraud probability
fraud_probability = self.fraud_classifier.predict_proba(scaled_features)[0, 1]
# Combined risk score
risk_score = 0.3 * (1 - (anomaly_score + 1) / 2) + 0.7 * fraud_probability
# Risk level determination
if risk_score > 0.8:
risk_level = "HIGH"
elif risk_score > 0.5:
risk_level = "MEDIUM"
else:
risk_level = "LOW"
return {
'risk_score': risk_score,
'risk_level': risk_level,
'anomaly_score': anomaly_score,
'fraud_probability': fraud_probability
}
# Credit Risk Assessment
class CreditRiskModel:
"""Credit risk assessment and loan default prediction"""
def __init__(self):
self.credit_model = None
self.feature_importance = None
def prepare_credit_features(self, applicant_data):
"""Prepare features for credit risk assessment"""
import pandas as pd
import numpy as np
# Financial ratios
applicant_data['debt_to_income'] = applicant_data['total_debt'] / applicant_data['annual_income']
applicant_data['credit_utilization'] = applicant_data['credit_used'] / applicant_data['credit_limit']
applicant_data['payment_to_income'] = applicant_data['monthly_payment'] / (applicant_data['annual_income'] / 12)
# Credit history features
applicant_data['credit_history_years'] = applicant_data['oldest_account_age'] / 12
applicant_data['avg_account_age'] = applicant_data['total_account_age'] / applicant_data['num_accounts']
# Behavioral features
applicant_data['recent_inquiries_rate'] = applicant_data['inquiries_6m'] / 6
applicant_data['delinquency_rate'] = applicant_data['delinquencies'] / applicant_data['num_accounts']
return applicant_data
def build_credit_model(self, features, default_labels):
"""Build credit risk prediction model"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_curve, auc
import matplotlib.pyplot as plt
self.credit_model = RandomForestClassifier(
n_estimators=200,
max_depth=10,
min_samples_split=10,
random_state=42
)
self.credit_model.fit(features, default_labels)
# Feature importance analysis
self.feature_importance = pd.DataFrame({
'feature': features.columns,
'importance': self.credit_model.feature_importances_
}).sort_values('importance', ascending=False)
return self.feature_importance
def calculate_credit_score(self, applicant_features):
"""Calculate credit score based on risk probability"""
default_probability = self.credit_model.predict_proba(applicant_features)[:, 1]
# Convert probability to credit score (300-850 range)
credit_score = 850 - (default_probability * 550)
return credit_score[0], default_probability[0]class AlgorithmicTradingSystem:
"""Machine learning-based trading system"""
def __init__(self):
self.price_predictor = None
self.signal_generator = None
self.risk_manager = None
def technical_indicators(self, price_data):
"""Calculate technical indicators for trading signals"""
import pandas as pd
import numpy as np
# Moving averages
price_data['SMA_20'] = price_data['close'].rolling(window=20).mean()
price_data['SMA_50'] = price_data['close'].rolling(window=50).mean()
price_data['EMA_12'] = price_data['close'].ewm(span=12).mean()
price_data['EMA_26'] = price_data['close'].ewm(span=26).mean()
# MACD
price_data['MACD'] = price_data['EMA_12'] - price_data['EMA_26']
price_data['MACD_signal'] = price_data['MACD'].ewm(span=9).mean()
# RSI
delta = price_data['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
price_data['RSI'] = 100 - (100 / (1 + rs))
# Bollinger Bands
price_data['BB_middle'] = price_data['close'].rolling(window=20).mean()
bb_std = price_data['close'].rolling(window=20).std()
price_data['BB_upper'] = price_data['BB_middle'] + (bb_std * 2)
price_data['BB_lower'] = price_data['BB_middle'] - (bb_std * 2)
# Volume indicators
price_data['volume_sma'] = price_data['volume'].rolling(window=20).mean()
price_data['volume_ratio'] = price_data['volume'] / price_data['volume_sma']
return price_data
def build_price_predictor(self, market_data, target_returns):
"""Build LSTM model for price prediction"""
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.preprocessing import MinMaxScaler
# Prepare data for LSTM
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(market_data)
# Create sequences for LSTM
def create_sequences(data, seq_length=60):
X, y = [], []
for i in range(seq_length, len(data)):
X.append(data[i-seq_length:i])
y.append(data[i, 0]) # Predict close price
return np.array(X), np.array(y)
X, y = create_sequences(scaled_data)
# LSTM model architecture
model = models.Sequential([
layers.LSTM(50, return_sequences=True, input_shape=(X.shape[1], X.shape[2])),
layers.Dropout(0.2),
layers.LSTM(50, return_sequences=True),
layers.Dropout(0.2),
layers.LSTM(50),
layers.Dropout(0.2),
layers.Dense(25),
layers.Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
# Train the model
history = model.fit(X, y, epochs=50, batch_size=32, validation_split=0.2)
self.price_predictor = model
return history
def generate_trading_signals(self, market_features):
"""Generate buy/sell signals based on ML predictions"""
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Create target variable (1 for buy, 0 for sell, based on future returns)
future_returns = market_features['close'].pct_change().shift(-1)
signals = np.where(future_returns > 0.01, 1, 0) # Buy if > 1% gain expected
# Feature selection for signal generation
signal_features = [
'RSI', 'MACD', 'volume_ratio', 'SMA_20', 'SMA_50',
'BB_upper', 'BB_lower', 'EMA_12', 'EMA_26'
]
X = market_features[signal_features].dropna()
y = signals[:len(X)]
# Random Forest for signal generation
self.signal_generator = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
self.signal_generator.fit(X, y)
# Generate current signals
current_signals = self.signal_generator.predict_proba(X)[:, 1]
return current_signals
def portfolio_optimization(self, returns_data, risk_tolerance=0.1):
"""Optimize portfolio allocation using modern portfolio theory"""
import numpy as np
from scipy.optimize import minimize
# Calculate expected returns and covariance matrix
expected_returns = returns_data.mean() * 252 # Annualized
cov_matrix = returns_data.cov() * 252 # Annualized
num_assets = len(expected_returns)
# Objective function: maximize Sharpe ratio
def objective(weights):
portfolio_return = np.sum(expected_returns * weights)
portfolio_volatility = np.sqrt(np.dot(weights.T, np.dot(cov_matrix, weights)))
return -portfolio_return / portfolio_volatility # Negative for maximization
# Constraints
constraints = [
{'type': 'eq', 'fun': lambda x: np.sum(x) - 1}, # Weights sum to 1
{'type': 'ineq', 'fun': lambda x: risk_tolerance - np.sqrt(np.dot(x.T, np.dot(cov_matrix, x)))} # Risk constraint
]
# Bounds for weights (0 to 1 for each asset)
bounds = tuple((0, 1) for _ in range(num_assets))
# Initial guess (equal weights)
initial_guess = np.array([1/num_assets] * num_assets)
# Optimization
result = minimize(
objective,
initial_guess,
method='SLSQP',
bounds=bounds,
constraints=constraints
)
optimal_weights = result.x
return optimal_weights, expected_returns, cov_matrixModern technology platforms rely heavily on ML-powered recommendation systems to enhance user experience and engagement.
class RecommendationSystem:
"""Comprehensive recommendation system with multiple approaches"""
def __init__(self):
self.collaborative_model = None
self.content_model = None
self.hybrid_model = None
def collaborative_filtering(self, user_item_matrix):
"""Collaborative filtering using matrix factorization"""
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error
import numpy as np
# Non-negative Matrix Factorization
self.collaborative_model = NMF(
n_components=50,
init='random',
random_state=42,
max_iter=200
)
# Fit the model
W = self.collaborative_model.fit_transform(user_item_matrix)
H = self.collaborative_model.components_
# Reconstruct the matrix
predicted_ratings = np.dot(W, H)
return predicted_ratings, W, H
def content_based_filtering(self, item_features, user_profiles):
"""Content-based recommendation using item features"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
# Create TF-IDF vectors for item descriptions
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')
item_tfidf = tfidf.fit_transform(item_features['description'])
# Calculate item-item similarity
item_similarity = cosine_similarity(item_tfidf)
# Generate recommendations based on user history
def get_content_recommendations(user_id, user_history, top_k=10):
# Get items the user has interacted with
user_items = user_history[user_history['user_id'] == user_id]['item_id'].values
# Calculate average similarity scores for unseen items
recommendations = {}
for item_id in range(len(item_features)):
if item_id not in user_items:
sim_scores = item_similarity[user_items, item_id].mean()
recommendations[item_id] = sim_scores
# Sort and return top recommendations
sorted_recs = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
return [item_id for item_id, score in sorted_recs[:top_k]]
return get_content_recommendations, item_similarity
def deep_collaborative_filtering(self, user_item_interactions, embedding_dim=64):
"""Deep learning approach to collaborative filtering"""
import tensorflow as tf
from tensorflow.keras import layers, models, optimizers
# Get unique users and items
num_users = user_item_interactions['user_id'].nunique()
num_items = user_item_interactions['item_id'].nunique()
# Create user and item embeddings
user_input = layers.Input(shape=(), name='user_id')
item_input = layers.Input(shape=(), name='item_id')
user_embedding = layers.Embedding(num_users, embedding_dim, name='user_embedding')(user_input)
item_embedding = layers.Embedding(num_items, embedding_dim, name='item_embedding')(item_input)
user_vec = layers.Flatten(name='user_flatten')(user_embedding)
item_vec = layers.Flatten(name='item_flatten')(item_embedding)
# Neural collaborative filtering
concat = layers.Concatenate()([user_vec, item_vec])
dense1 = layers.Dense(128, activation='relu')(concat)
dropout1 = layers.Dropout(0.2)(dense1)
dense2 = layers.Dense(64, activation='relu')(dropout1)
dropout2 = layers.Dropout(0.2)(dense2)
output = layers.Dense(1, activation='sigmoid')(dropout2)
model = models.Model(inputs=[user_input, item_input], outputs=output)
model.compile(
optimizer=optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['mae']
)
self.collaborative_model = model
return model
def hybrid_recommendation(self, user_id, collaborative_scores, content_scores, alpha=0.7):
"""Combine collaborative and content-based recommendations"""
# Weighted combination of scores
hybrid_scores = alpha * collaborative_scores + (1 - alpha) * content_scores
# Sort and return top recommendations
sorted_indices = np.argsort(hybrid_scores)[::-1]
return sorted_indices, hybrid_scores
def evaluate_recommendations(self, true_ratings, predicted_ratings):
"""Evaluate recommendation system performance"""
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
# Filter out missing values
mask = ~np.isnan(true_ratings) & ~np.isnan(predicted_ratings)
true_filtered = true_ratings[mask]
pred_filtered = predicted_ratings[mask]
# Calculate metrics
rmse = np.sqrt(mean_squared_error(true_filtered, pred_filtered))
mae = mean_absolute_error(true_filtered, pred_filtered)
# Precision and Recall at K
def precision_recall_at_k(true_ratings, predicted_ratings, k=10, threshold=4.0):
# Sort predictions
sorted_indices = np.argsort(predicted_ratings)[::-1][:k]
# Get top-k recommendations
top_k_true = true_ratings[sorted_indices]
# Calculate precision and recall
relevant_items = np.sum(top_k_true >= threshold)
total_relevant = np.sum(true_ratings >= threshold)
precision = relevant_items / k if k > 0 else 0
recall = relevant_items / total_relevant if total_relevant > 0 else 0
return precision, recall
precision_10, recall_10 = precision_recall_at_k(true_filtered, pred_filtered)
return {
'RMSE': rmse,
'MAE': mae,
'Precision@10': precision_10,
'Recall@10': recall_10
}
# Search and Information Retrieval
class SearchSystem:
"""ML-powered search and information retrieval system"""
def __init__(self):
self.query_processor = None
self.document_embeddings = None
self.ranking_model = None
def semantic_search(self, documents, queries):
"""Semantic search using sentence embeddings"""
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Load pre-trained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode documents and queries
document_embeddings = model.encode(documents)
query_embeddings = model.encode(queries)
# Calculate similarities
similarities = cosine_similarity(query_embeddings, document_embeddings)
self.document_embeddings = document_embeddings
return similarities
def learning_to_rank(self, query_doc_features, relevance_scores):
"""Learn to rank model for search result ordering"""
from sklearn.ensemble import GradientBoostingRanker
import numpy as np
# Features for learning to rank
# - Query-document similarity scores
# - Document popularity scores
# - Click-through rates
# - Document freshness
# - Query-document feature interactions
self.ranking_model = GradientBoostingRanker(
n_estimators=100,
learning_rate=0.1,
max_depth=6,
random_state=42
)
self.ranking_model.fit(query_doc_features, relevance_scores)
return self.ranking_model
def query_expansion(self, original_query, word_embeddings):
"""Expand queries using word embeddings for better recall"""
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Get embedding for original query terms
query_terms = original_query.lower().split()
query_embeddings = []
for term in query_terms:
if term in word_embeddings:
query_embeddings.append(word_embeddings[term])
if not query_embeddings:
return [original_query]
# Find similar terms
avg_embedding = np.mean(query_embeddings, axis=0)
# Calculate similarity with all terms in vocabulary
similarities = {}
for word, embedding in word_embeddings.items():
sim = cosine_similarity([avg_embedding], [embedding])[0][0]
similarities[word] = sim
# Get top similar terms
similar_terms = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:5]
expansion_terms = [term for term, score in similar_terms if score > 0.7]
# Create expanded query
expanded_query = original_query + " " + " ".join(expanded_terms)
return expanded_queryclass ComputerVisionSystem:
"""Computer vision applications for various domains"""
def __init__(self):
self.object_detector = None
self.image_classifier = None
self.segmentation_model = None
def object_detection_yolo(self, image_path):
"""YOLO-based object detection"""
import cv2
import numpy as np
# Load pre-trained YOLO model
net = cv2.dnn.readNet('yolov4.weights', 'yolov4.cfg')
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# Load and preprocess image
image = cv2.imread(image_path)
height, width, channels = image.shape
blob = cv2.dnn.blobFromImage(image, 0.00392, (416, 416), (0, 0, 0), True, crop=False)
net.setInput(blob)
outputs = net.forward(output_layers)
# Extract detections
boxes = []
confidences = []
class_ids = []
for output in outputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
# Non-maximum suppression
indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
return boxes, confidences, class_ids, indices
def facial_recognition_system(self, face_images, face_labels):
"""Build facial recognition system"""
import cv2
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
# Face detection using Haar cascades
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
# Extract face embeddings using pre-trained model
def extract_face_features(image):
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.1, 4)
if len(faces) > 0:
(x, y, w, h) = faces[0] # Take the first detected face
face_roi = gray[y:y+h, x:x+w]
face_resized = cv2.resize(face_roi, (100, 100))
return face_resized.flatten()
return None
# Extract features from all face images
features = []
labels = []
for image, label in zip(face_images, face_labels):
feature = extract_face_features(image)
if feature is not None:
features.append(feature)
labels.append(label)
# Train SVM classifier
le = LabelEncoder()
encoded_labels = le.fit_transform(labels)
self.face_classifier = SVC(kernel='rbf', probability=True, random_state=42)
self.face_classifier.fit(features, encoded_labels)
self.label_encoder = le
return self.face_classifier
def image_segmentation(self, image_data, num_classes=21):
"""Semantic segmentation using U-Net architecture"""
import tensorflow as tf
from tensorflow.keras import layers, models
def unet_model(input_size=(256, 256, 3)):
inputs = layers.Input(input_size)
# Encoder
c1 = layers.Conv2D(64, 3, activation='relu', padding='same')(inputs)
c1 = layers.Conv2D(64, 3, activation='relu', padding='same')(c1)
p1 = layers.MaxPooling2D(2)(c1)
c2 = layers.Conv2D(128, 3, activation='relu', padding='same')(p1)
c2 = layers.Conv2D(128, 3, activation='relu', padding='same')(c2)
p2 = layers.MaxPooling2D(2)(c2)
c3 = layers.Conv2D(256, 3, activation='relu', padding='same')(p2)
c3 = layers.Conv2D(256, 3, activation='relu', padding='same')(c3)
p3 = layers.MaxPooling2D(2)(c3)
c4 = layers.Conv2D(512, 3, activation='relu', padding='same')(p3)
c4 = layers.Conv2D(512, 3, activation='relu', padding='same')(c4)
p4 = layers.MaxPooling2D(2)(c4)
# Bottleneck
c5 = layers.Conv2D(1024, 3, activation='relu', padding='same')(p4)
c5 = layers.Conv2D(1024, 3, activation='relu', padding='same')(c5)
# Decoder
u6 = layers.UpSampling2D(2)(c5)
u6 = layers.concatenate([u6, c4])
c6 = layers.Conv2D(512, 3, activation='relu', padding='same')(u6)
c6 = layers.Conv2D(512, 3, activation='relu', padding='same')(c6)
u7 = layers.UpSampling2D(2)(c6)
u7 = layers.concatenate([u7, c3])
c7 = layers.Conv2D(256, 3, activation='relu', padding='same')(u7)
c7 = layers.Conv2D(256, 3, activation='relu', padding='same')(c7)
u8 = layers.UpSampling2D(2)(c7)
u8 = layers.concatenate([u8, c2])
c8 = layers.Conv2D(128, 3, activation='relu', padding='same')(u8)
c8 = layers.Conv2D(128, 3, activation='relu', padding='same')(c8)
u9 = layers.UpSampling2D(2)(c8)
u9 = layers.concatenate([u9, c1])
c9 = layers.Conv2D(64, 3, activation='relu', padding='same')(u9)
c9 = layers.Conv2D(64, 3, activation='relu', padding='same')(c9)
outputs = layers.Conv2D(num_classes, 1, activation='softmax')(c9)
model = models.Model(inputs, outputs)
return model
self.segmentation_model = unet_model()
self.segmentation_model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return self.segmentation_modelMachine learning enables proactive maintenance strategies by predicting equipment failures before they occur, reducing downtime and costs.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
class PredictiveMaintenanceSystem:
"""
A comprehensive predictive maintenance system for industrial equipment
"""
def __init__(self):
self.failure_model = RandomForestClassifier(n_estimators=100, random_state=42)
self.anomaly_detector = IsolationForest(contamination=0.1, random_state=42)
self.scaler = StandardScaler()
self.feature_importance = None
def prepare_features(self, sensor_data):
"""
Extract features from sensor data for predictive modeling
"""
features = {}
# Statistical features
features['temp_mean'] = sensor_data['temperature'].mean()
features['temp_std'] = sensor_data['temperature'].std()
features['temp_max'] = sensor_data['temperature'].max()
features['temp_min'] = sensor_data['temperature'].min()
features['vibration_mean'] = sensor_data['vibration'].mean()
features['vibration_std'] = sensor_data['vibration'].std()
features['vibration_rms'] = np.sqrt(np.mean(sensor_data['vibration']**2))
features['pressure_mean'] = sensor_data['pressure'].mean()
features['pressure_std'] = sensor_data['pressure'].std()
# Operational features
features['operating_hours'] = sensor_data['cumulative_operating_hours'].iloc[-1]
features['cycles_since_maintenance'] = sensor_data['cumulative_load_cycles'].iloc[-1]
# Time-based features
features['time_since_last_failure'] = sensor_data['time_since_last_failure'].iloc[-1]
return pd.Series(features)
def train_failure_prediction(self, training_data, labels):
"""
Train the failure prediction model
"""
# Prepare features
X = []
for equipment_id in training_data['equipment_id'].unique():
equipment_data = training_data[training_data['equipment_id'] == equipment_id]
features = self.prepare_features(equipment_data)
X.append(features)
X = pd.DataFrame(X)
# Scale features
X_scaled = self.scaler.fit_transform(X)
# Train model
self.failure_model.fit(X_scaled, labels)
# Store feature importance
self.feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': self.failure_model.feature_importances_
}).sort_values('importance', ascending=False)
return self.failure_model
def predict_failure_probability(self, sensor_data):
"""
Predict failure probability for given equipment
"""
features = self.prepare_features(sensor_data)
features_scaled = self.scaler.transform([features])
failure_probability = self.failure_model.predict_proba(features_scaled)[:, 1]
return failure_probability
def detect_anomalies(self, sensor_data):
"""
Detect anomalies in sensor readings
"""
# Prepare time-series features
features = sensor_data[['temperature', 'vibration', 'pressure']].values
# Detect anomalies
anomaly_scores = self.anomaly_detector.decision_function(features)
anomalies = self.anomaly_detector.predict(features)
return anomaly_scores, anomalies
def maintenance_recommendations(self, equipment_data):
"""
Generate maintenance recommendations based on predictions
"""
recommendations = []
for equipment_id in equipment_data['equipment_id'].unique():
data = equipment_data[equipment_data['equipment_id'] == equipment_id]
# Get failure probability
failure_prob = self.predict_failure_probability(data)
# Get anomaly detection results
_, anomalies = self.detect_anomalies(data)
anomaly_count = sum(anomalies == -1)
# Generate recommendation
if failure_prob > 0.8:
priority = "Critical"
action = "Schedule immediate maintenance"
elif failure_prob > 0.6:
priority = "High"
action = "Schedule maintenance within 48 hours"
elif failure_prob > 0.4 or anomaly_count > 10:
priority = "Medium"
action = "Schedule maintenance within 1 week"
else:
priority = "Low"
action = "Continue monitoring"
recommendations.append({
'equipment_id': equipment_id,
'failure_probability': failure_prob,
'anomaly_count': anomaly_count,
'priority': priority,
'recommended_action': action
})
return pd.DataFrame(recommendations)
def visualize_equipment_health(self, sensor_data, equipment_id):
"""
Create visualizations for equipment health monitoring
"""
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Temperature over time
axes[0, 0].plot(sensor_data.index, sensor_data['temperature'])
axes[0, 0].set_title(f'Temperature - Equipment {equipment_id}')
axes[0, 0].set_ylabel('Temperature (°C)')
# Vibration over time
axes[0, 1].plot(sensor_data.index, sensor_data['vibration'], color='orange')
axes[0, 1].set_title(f'Vibration - Equipment {equipment_id}')
axes[0, 1].set_ylabel('Vibration (mm/s)')
# Pressure over time
axes[1, 0].plot(sensor_data.index, sensor_data['pressure'], color='green')
axes[1, 0].set_title(f'Pressure - Equipment {equipment_id}')
axes[1, 0].set_ylabel('Pressure (bar)')
# Feature importance
if self.feature_importance is not None:
top_features = self.feature_importance.head(8)
axes[1, 1].barh(top_features['feature'], top_features['importance'])
axes[1, 1].set_title('Feature Importance')
axes[1, 1].set_xlabel('Importance')
plt.tight_layout()
return fig
# Example usage
if __name__ == "__main__":
# Initialize the predictive maintenance system
pm_system = PredictiveMaintenanceSystem()
# Generate sample data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=1000, freq='H')
sample_data = pd.DataFrame({
'timestamp': dates,
'equipment_id': np.random.choice(['EQ001', 'EQ002', 'EQ003'], 1000),
'temperature': np.random.normal(75, 10, 1000),
'vibration': np.random.normal(2.5, 0.5, 1000),
'pressure': np.random.normal(15, 2, 1000),
'cumulative_operating_hours': np.cumsum(np.ones(1000)),
'cumulative_load_cycles': np.random.randint(0, 500, 1000),
'time_since_last_failure': np.random.randint(0, 1000, 1000)
})
# Generate failure labels (for training)
failure_labels = np.random.choice([0, 1], 100, p=[0.8, 0.2])
print("Predictive Maintenance System Demo")
print("=================================")
print(f"Sample data shape: {sample_data.shape}")
print(f"Equipment IDs: {sample_data['equipment_id'].unique()}")ML-powered quality control systems can detect defects in real-time during manufacturing processes.
import cv2
import numpy as np
from tensorflow.keras import layers, models
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import matplotlib.pyplot as plt
class ManufacturingQualityControl:
"""
Computer vision-based quality control system for manufacturing
"""
def __init__(self, input_shape=(224, 224, 3)):
self.input_shape = input_shape
self.defect_model = self._build_defect_detection_model()
self.surface_inspector = self._build_surface_inspection_model()
def _build_defect_detection_model(self):
"""
Build CNN model for defect detection
"""
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=self.input_shape),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu'),
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(4, activation='softmax') # 4 classes: good, scratch, dent, crack
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
def _build_surface_inspection_model(self):
"""
Build model for surface quality inspection
"""
# Using transfer learning with pre-trained model
base_model = tf.keras.applications.VGG16(
weights='imagenet',
include_top=False,
input_shape=self.input_shape
)
base_model.trainable = False
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(2, activation='softmax') # good/defective
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
def preprocess_image(self, image_path):
"""
Preprocess image for model input
"""
image = cv2.imread(image_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (224, 224))
image = image.astype('float32') / 255.0
return np.expand_dims(image, axis=0)
def detect_defects(self, image_path):
"""
Detect defects in manufacturing part
"""
image = self.preprocess_image(image_path)
# Predict defect type
prediction = self.defect_model.predict(image)
defect_classes = ['good', 'scratch', 'dent', 'crack']
predicted_class = defect_classes[np.argmax(prediction)]
confidence = np.max(prediction)
# Surface quality inspection
surface_prediction = self.surface_inspector.predict(image)
surface_quality = 'good' if np.argmax(surface_prediction) == 0 else 'defective'
surface_confidence = np.max(surface_prediction)
return {
'defect_type': predicted_class,
'defect_confidence': confidence,
'surface_quality': surface_quality,
'surface_confidence': surface_confidence
}
def batch_inspection(self, image_paths):
"""
Perform batch inspection of multiple parts
"""
results = []
for image_path in image_paths:
result = self.detect_defects(image_path)
result['image_path'] = image_path
results.append(result)
return pd.DataFrame(results)ML algorithms optimize supply chain operations through demand forecasting and inventory management.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
class SupplyChainOptimizer:
"""
ML-powered supply chain optimization system
"""
def __init__(self):
self.demand_forecaster = RandomForestRegressor(n_estimators=100, random_state=42)
self.inventory_optimizer = None
self.seasonal_model = None
def prepare_demand_features(self, data):
"""
Prepare features for demand forecasting
"""
features = data.copy()
# Time-based features
features['year'] = features['date'].dt.year
features['month'] = features['date'].dt.month
features['day_of_week'] = features['date'].dt.dayofweek
features['quarter'] = features['date'].dt.quarter
# Lag features
for lag in [1, 7, 30]:
features[f'demand_lag_{lag}'] = features['demand'].shift(lag)
# Rolling statistics
features['demand_rolling_7'] = features['demand'].rolling(7).mean()
features['demand_rolling_30'] = features['demand'].rolling(30).mean()
features['demand_std_7'] = features['demand'].rolling(7).std()
# Economic indicators (if available)
if 'economic_indicator' in features.columns:
features['economic_lag_1'] = features['economic_indicator'].shift(1)
return features.dropna()
def train_demand_forecaster(self, historical_data):
"""
Train demand forecasting model
"""
# Prepare features
features = self.prepare_demand_features(historical_data)
# Select feature columns
feature_cols = [col for col in features.columns
if col not in ['date', 'demand', 'product_id']]
X = features[feature_cols]
y = features['demand']
# Time series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
cv_scores = []
for train_idx, val_idx in tscv.split(X):
X_train_cv, X_val_cv = X.iloc[train_idx], X.iloc[val_idx]
y_train_cv, y_val_cv = y.iloc[train_idx], y.iloc[val_idx]
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_cv)
X_val_scaled = scaler.transform(X_val_cv)
# Train and evaluate
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train_cv)
val_pred = model.predict(X_val_scaled)
mae = mean_absolute_error(y_val_cv, val_pred)
cv_scores.append(mae)
# Train final model on all data
X_scaled = self.scaler.fit_transform(X)
self.demand_forecaster.fit(X_scaled, y)
return {
'cv_mae_mean': np.mean(cv_scores),
'cv_mae_std': np.std(cv_scores)
}
def optimize_inventory(self, demand_forecast, lead_time=7, service_level=0.95):
"""Optimize inventory levels based on demand forecast"""
import numpy as np
from scipy.optimize import minimize
# Calculate statistics
avg_demand = demand_forecast['predicted_demand'].mean()
demand_std = demand_forecast['predicted_demand'].std()
# Calculate safety stock
from scipy.stats import norm
z_score = norm.ppf(service_level)
safety_stock = z_score * demand_std * np.sqrt(lead_time)
# Calculate reorder point
reorder_point = (avg_demand * lead_time) + safety_stock
# Calculate economic order quantity (EOQ)
annual_demand = avg_demand * 252 # Annualized
ordering_cost = 50 # Assumed ordering cost
holding_cost_rate = 0.2 # 20% of item cost
item_cost = 10 # Assumed item cost
eoq = np.sqrt((2 * annual_demand * ordering_cost) /
(holding_cost_rate * item_cost))
return {
'reorder_point': reorder_point,
'safety_stock': safety_stock,
'economic_order_quantity': eoq
}
def seasonal_decomposition_forecasting(self, time_series_data, model='additive'):
"""
Seasonal decomposition and forecasting using ARIMA
"""
# Decompose the time series
decomposition = seasonal_decompose(time_series_data, model=model)
decomposition.plot()
plt.show()
# Fit ARIMA model
model = ARIMA(time_series_data, order=(1, 1, 1))
model_fit = model.fit()
# Forecast future values
forecast = model_fit.forecast(steps=30)
return forecastMachine learning enables intelligent energy distribution, demand prediction, and grid optimization for improved efficiency and reliability.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')
class SmartGridOptimizer:
"""
Comprehensive smart grid optimization and energy management system
"""
def __init__(self):
self.load_forecaster = GradientBoostingRegressor(n_estimators=100, random_state=42)
self.renewable_predictor = RandomForestRegressor(n_estimators=100, random_state=42)
self.fault_detector = RandomForestRegressor(n_estimators=100, random_state=42)
self.scaler = StandardScaler()
self.feature_importance = {}
def prepare_energy_features(self, energy_data, weather_data, soil_data):
"""
Prepare comprehensive features for energy forecasting and optimization
"""
# Merge all data sources
features = energy_data.merge(weather_data, on=['date', 'field_id'], how='left')
features = features.merge(soil_data, on='field_id', how='left')
# Weather-derived features
features['growing_degree_days'] = np.maximum(
(features['temperature_avg'] + features['temperature_min']) / 2 - features['base_temperature'], 0
)
features['heat_stress_days'] = (features['temperature_max'] > 32).astype(int)
features['frost_risk'] = (features['temperature_min'] < 0).astype(int)
# Precipitation features
features['cumulative_rainfall'] = features.groupby('field_id')['precipitation'].cumsum()
features['days_since_rain'] = features.groupby('field_id')['precipitation'].apply(
lambda x: (x == 0).cumsum() - (x == 0).cumsum().where(x != 0).ffill().fillna(0)
)
# Soil moisture estimation
features['estimated_soil_moisture'] = (
features['soil_moisture_capacity'] *
np.exp(-features['days_since_rain'] * 0.1) +
features['precipitation'] * 0.8
)
# Seasonal features
features['day_of_year'] = features['date'].dt.dayofyear
features['growing_season'] = ((features['day_of_year'] >= 100) &
(features['day_of_year'] <= 280)).astype(int)
# Crop development stage estimation
features['estimated_growth_stage'] = self._estimate_growth_stage(
features['planting_date'], features['date'], features['crop_type']
)
# Historical yield averages
if 'historical_yield' in features.columns:
features['yield_deviation_historical'] = (
features.groupby('field_id')['historical_yield'].transform('mean')
)
return features
def _estimate_growth_stage(self, planting_date, current_date, crop_type):
"""
Estimate crop growth stage based on planting date and crop type
"""
days_since_planting = (current_date - planting_date).dt.days
# Growth stage thresholds (days) for different crops
growth_stages = {
'corn': {'germination': 10, 'vegetative': 45, 'reproductive': 90, 'maturity': 120},
'wheat': {'germination': 7, 'vegetative': 30, 'reproductive': 80, 'maturity': 110},
'soybeans': {'germination': 8, 'vegetative': 35, 'reproductive': 75, 'maturity': 115},
'rice': {'germination': 12, 'vegetative': 50, 'reproductive': 100, 'maturity': 130}
}
# Default to corn if crop type not found
stages = growth_stages.get(crop_type.iloc[0] if hasattr(crop_type, 'iloc') else 'corn',
growth_stages['corn'])
stage_values = []
for days in days_since_planting:
if days < stages['germination']:
stage_values.append(1) # Germination
elif days < stages['vegetative']:
stage_values.append(2) # Vegetative
elif days < stages['reproductive']:
stage_values.append(3) # Reproductive
elif days < stages['maturity']:
stage_values.append(4) # Maturity
else:
stage_values.append(5) # Post-harvest
return pd.Series(stage_values)
def train_yield_prediction_model(self, training_data, weather_data, soil_data):
"""
Train crop yield prediction model
"""
# Prepare features
features = self.prepare_agricultural_features(training_data, weather_data, soil_data)
features_clean = features.dropna()
# Define feature columns
feature_columns = [
'growing_degree_days', 'heat_stress_days', 'frost_risk',
'cumulative_rainfall', 'days_since_rain', 'estimated_soil_moisture',
'day_of_year', 'growing_season', 'estimated_growth_stage',
'soil_ph', 'soil_organic_matter', 'soil_nitrogen', 'soil_phosphorus',
'fertilizer_nitrogen', 'fertilizer_phosphorus', 'irrigation_amount'
]
# Train separate models for different crops
crop_performance = {}
for crop_type in features_clean['crop_type'].unique():
crop_data = features_clean[features_clean['crop_type'] == crop_type]
if len(crop_data) < 20: # Need sufficient data
continue
X = crop_data[feature_columns]
y = crop_data['actual_yield']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
# Store model and performance
self.crop_models[crop_type] = {
'model': model,
'scaler': scaler,
'mae': mae,
'feature_importance': pd.DataFrame({
'feature': feature_columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
}
crop_performance[crop_type] = mae
return crop_performance
def predict_crop_yield(self, field_data, weather_forecast, soil_data):
"""
Predict crop yields for specified fields
"""
# Prepare features
features = self.prepare_agricultural_features(field_data, weather_forecast, soil_data)
predictions = []
for crop_type in features['crop_type'].unique():
if crop_type not in self.crop_models:
continue
crop_data = features[features['crop_type'] == crop_type]
model_info = self.crop_models[crop_type]
# Prepare prediction features
feature_columns = [
'growing_degree_days', 'heat_stress_days', 'frost_risk',
'cumulative_rainfall', 'days_since_rain', 'estimated_soil_moisture',
'day_of_year', 'growing_season', 'estimated_growth_stage',
'soil_ph', 'soil_organic_matter', 'soil_nitrogen', 'soil_phosphorus',
'fertilizer_nitrogen', 'fertilizer_phosphorus', 'irrigation_amount'
]
X = crop_data[feature_columns].fillna(0)
X_scaled = model_info['scaler'].transform(X)
# Make predictions
yield_predictions = model_info['model'].predict(X_scaled)
# Add to results
for i, prediction in enumerate(yield_predictions):
predictions.append({
'field_id': crop_data.iloc[i]['field_id'],
'crop_type': crop_type,
'predicted_yield': prediction,
'confidence_interval_lower': prediction * 0.85, # Simplified confidence interval
'confidence_interval_upper': prediction * 1.15,
'model_mae': model_info['mae']
})
return pd.DataFrame(predictions)
def optimize_resource_allocation(self, field_data, resource_constraints):
"""
Optimize allocation of water, fertilizer, and other resources
"""
optimization_results = []
# Available resources
total_water = resource_constraints.get('total_water_budget', 1000000) # liters
total_nitrogen = resource_constraints.get('total_nitrogen_budget', 5000) # kg
total_phosphorus = resource_constraints.get('total_phosphorus_budget', 2000) # kg
# Calculate resource needs per field
for _, field in field_data.iterrows():
field_id = field['field_id']
field_area = field['field_area_hectares']
crop_type = field['crop_type']
current_soil_moisture = field.get('current_soil_moisture', 0.3)
# Water optimization
optimal_soil_moisture = 0.6 # Target soil moisture
water_deficit = max(0, optimal_soil_moisture - current_soil_moisture)
water_needed = water_deficit * field_area * 10000 * 0.3 # liters per hectare
# Fertilizer optimization based on soil tests
nitrogen_deficiency = max(0, 40 - field.get('soil_nitrogen', 20)) # mg/kg
phosphorus_deficiency = max(0, 20 - field.get('soil_phosphorus', 10)) # mg/kg
nitrogen_needed = nitrogen_deficiency * field_area * 2.24 # kg per hectare
phosphorus_needed = phosphorus_deficiency * field_area * 1.12 # kg per hectare
# Priority scoring based on yield potential and deficiency
yield_potential = field.get('expected_yield', 5000) # kg per hectare
priority_score = (
(water_deficit * 0.4) +
(nitrogen_deficiency / 40 * 0.3) +
(phosphorus_deficiency / 20 * 0.2) +
(yield_potential / 10000 * 0.1)
)
optimization_results.append({
'field_id': field_id,
'crop_type': crop_type,
'field_area': field_area,
'water_needed_liters': water_needed,
'nitrogen_needed_kg': nitrogen_needed,
'phosphorus_needed_kg': phosphorus_needed,
'priority_score': priority_score,
'expected_yield_improvement': self._estimate_yield_improvement(
water_deficit, nitrogen_deficiency, phosphorus_deficiency
)
})
# Sort by priority and allocate resources
results_df = pd.DataFrame(optimization_results).sort_values('priority_score', ascending=False)
# Resource allocation with constraints
allocated_water = 0
allocated_nitrogen = 0
allocated_phosphorus = 0
for idx, row in results_df.iterrows():
# Allocate water
water_allocation = min(row['water_needed_liters'], total_water - allocated_water)
nitrogen_allocation = min(row['nitrogen_needed_kg'], total_nitrogen - allocated_nitrogen)
phosphorus_allocation = min(row['phosphorus_needed_kg'], total_phosphorus - allocated_phosphorus)
results_df.loc[idx, 'allocated_water_liters'] = water_allocation
results_df.loc[idx, 'allocated_nitrogen_kg'] = nitrogen_allocation
results_df.loc[idx, 'allocated_phosphorus_kg'] = phosphorus_allocation
# Calculate satisfaction ratios
results_df.loc[idx, 'water_satisfaction'] = (
water_allocation / row['water_needed_liters'] if row['water_needed_liters'] > 0 else 1.0
)
results_df.loc[idx, 'nutrient_satisfaction'] = (
(nitrogen_allocation / row['nitrogen_needed_kg'] +
phosphorus_allocation / row['phosphorus_needed_kg']) / 2
if row['nitrogen_needed_kg'] > 0 or row['phosphorus_needed_kg'] > 0 else 1.0
)
allocated_water += water_allocation
allocated_nitrogen += nitrogen_allocation
allocated_phosphorus += phosphorus_allocation
# Stop if resources exhausted
if (allocated_water >= total_water * 0.95 and
allocated_nitrogen >= total_nitrogen * 0.95 and
allocated_phosphorus >= total_phosphorus * 0.95):
break
return results_df
def _estimate_yield_improvement(self, water_deficit, nitrogen_deficiency, phosphorus_deficiency):
"""
Estimate yield improvement from addressing deficiencies
"""
# Simplified yield response curves
water_improvement = (1 - np.exp(-water_deficit * 5)) * 0.3 # Up to 30% improvement
nitrogen_improvement = (nitrogen_deficiency / 40) * 0.25 # Up to 25% improvement
phosphorus_improvement = (phosphorus_deficiency / 20) * 0.15 # Up to 15% improvement
# Combined effect (not additive due to limiting factors)
total_improvement = 1 - ((1 - water_improvement) *
(1 - nitrogen_improvement) *
(1 - phosphorus_improvement))
return min(total_improvement, 0.5) # Cap at 50% improvement
class CropHealthMonitoring:
"""
System for monitoring crop health and detecting diseases/pests
"""
def __init__(self):
self.disease_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)
self.pest_detector = RandomForestRegressor(n_estimators=100, random_state=42)
self.health_scorer = RandomForestRegressor(n_estimators=100, random_state=42)
def analyze_crop_images(self, image_features):
"""
Analyze crop images for health assessment
"""
# Simulate image analysis features
# In practice, this would use computer vision to extract:
# - Color histograms
# - Texture features
# - Shape characteristics
# - Disease symptoms
health_indicators = []
for _, image in image_features.iterrows():
# Extract health indicators from image features
green_ratio = image.get('green_pixels_ratio', 0.6)
yellow_ratio = image.get('yellow_pixels_ratio', 0.1)
brown_ratio = image.get('brown_pixels_ratio', 0.05)
texture_variation = image.get('texture_variation', 0.3)
# Calculate health score
health_score = (
green_ratio * 0.5 +
(1 - yellow_ratio) * 0.3 +
(1 - brown_ratio) * 0.2
)
# Detect potential issues
issues = []
if yellow_ratio > 0.3:
issues.append('Nutrient deficiency (likely nitrogen)')
if brown_ratio > 0.2:
issues.append('Disease or drought stress')
if texture_variation > 0.6:
issues.append('Possible pest damage')
health_indicators.append({
'field_id': image['field_id'],
'image_date': image['capture_date'],
'health_score': health_score,
'green_ratio': green_ratio,
'stress_indicators': yellow_ratio + brown_ratio,
'potential_issues': issues,
'recommended_action': self._recommend_action(health_score, issues)
})
return pd.DataFrame(health_indicators)
def _recommend_action(self, health_score, issues):
"""
Recommend actions based on health assessment
"""
if health_score < 0.3:
return "Immediate intervention required - investigate and treat"
elif health_score < 0.6:
return "Monitor closely and consider treatment"
elif len(issues) > 0:
return f"Address specific issues: {', '.join(issues)}"
else:
return "Continue normal monitoring"
\newpage
## Agriculture and Environmental Sciences
### Precision Agriculture and Crop Optimization
Machine learning enables data-driven farming decisions through precision agriculture, crop monitoring, and yield optimization.
#### Key Applications:
- **Crop Monitoring**: Satellite and drone imagery analysis for crop health assessment
- **Yield Prediction**: Forecasting crop yields based on environmental and historical data
- **Pest and Disease Detection**: Early identification of agricultural threats
- **Resource Optimization**: Efficient use of water, fertilizers, and pesticides
#### Implementation Example:
```python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, accuracy_score, classification_report
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
class PrecisionAgricultureSystem:
"""
Comprehensive precision agriculture and crop optimization system
"""
def __init__(self):
self.yield_predictor = RandomForestRegressor(n_estimators=100, random_state=42)
self.disease_detector = GradientBoostingClassifier(n_estimators=100, random_state=42)
self.irrigation_optimizer = RandomForestRegressor(n_estimators=100, random_state=42)
self.scaler = StandardScaler()
self.crop_models = {}
def prepare_agricultural_features(self, farm_data, weather_data, soil_data):
"""
Prepare comprehensive features for agricultural modeling
"""
# Merge all data sources
features = farm_data.merge(weather_data, on=['date', 'field_id'], how='left')
features = features.merge(soil_data, on='field_id', how='left')
# Weather-derived features
features['growing_degree_days'] = np.maximum(
(features['temperature_avg'] + features['temperature_min']) / 2 - features['base_temperature'], 0
)
features['heat_stress_days'] = (features['temperature_max'] > 32).astype(int)
features['frost_risk'] = (features['temperature_min'] < 0).astype(int)
# Precipitation features
features['cumulative_rainfall'] = features.groupby('field_id')['precipitation'].cumsum()
features['days_since_rain'] = features.groupby('field_id')['precipitation'].apply(
lambda x: (x == 0).cumsum() - (x == 0).cumsum().where(x != 0).ffill().fillna(0)
)
# Soil moisture estimation
features['estimated_soil_moisture'] = (
features['soil_moisture_capacity'] *
np.exp(-features['days_since_rain'] * 0.1) +
features['precipitation'] * 0.8
)
# Seasonal features
features['day_of_year'] = features['date'].dt.dayofyear
features['growing_season'] = ((features['day_of_year'] >= 100) &
(features['day_of_year'] <= 280)).astype(int)
# Crop development stage estimation
features['estimated_growth_stage'] = self._estimate_growth_stage(
features['planting_date'], features['date'], features['crop_type']
)
# Historical yield averages
if 'historical_yield' in features.columns:
features['yield_deviation_historical'] = (
features.groupby('field_id')['historical_yield'].transform('mean')
)
return features
def _estimate_growth_stage(self, planting_date, current_date, crop_type):
"""
Estimate crop growth stage based on planting date and crop type
"""
days_since_planting = (current_date - planting_date).dt.days
# Growth stage thresholds (days) for different crops
growth_stages = {
'corn': {'germination': 10, 'vegetative': 45, 'reproductive': 90, 'maturity': 120},
'wheat': {'germination': 7, 'vegetative': 30, 'reproductive': 80, 'maturity': 110},
'soybeans': {'germination': 8, 'vegetative': 35, 'reproductive': 75, 'maturity': 115},
'rice': {'germination': 12, 'vegetative': 50, 'reproductive': 100, 'maturity': 130}
}
# Default to corn if crop type not found
stages = growth_stages.get(crop_type.iloc[0] if hasattr(crop_type, 'iloc') else 'corn',
growth_stages['corn'])
stage_values = []
for days in days_since_planting:
if days < stages['germination']:
stage_values.append(1) # Germination
elif days < stages['vegetative']:
stage_values.append(2) # Vegetative
elif days < stages['reproductive']:
stage_values.append(3) # Reproductive
elif days < stages['maturity']:
stage_values.append(4) # Maturity
else:
stage_values.append(5) # Post-harvest
return pd.Series(stage_values)
def train_yield_prediction_model(self, training_data, weather_data, soil_data):
"""
Train crop yield prediction model
"""
# Prepare features
features = self.prepare_agricultural_features(training_data, weather_data, soil_data)
features_clean = features.dropna()
# Define feature columns
feature_columns = [
'growing_degree_days', 'heat_stress_days', 'frost_risk',
'cumulative_rainfall', 'days_since_rain', 'estimated_soil_moisture',
'day_of_year', 'growing_season', 'estimated_growth_stage',
'soil_ph', 'soil_organic_matter', 'soil_nitrogen', 'soil_phosphorus',
'fertilizer_nitrogen', 'fertilizer_phosphorus', 'irrigation_amount'
]
# Train separate models for different crops
crop_performance = {}
for crop_type in features_clean['crop_type'].unique():
crop_data = features_clean[features_clean['crop_type'] == crop_type]
if len(crop_data) < 20: # Need sufficient data
continue
X = crop_data[feature_columns]
y = crop_data['actual_yield']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
# Store model and performance
self.crop_models[crop_type] = {
'model': model,
'scaler': scaler,
'mae': mae,
'feature_importance': pd.DataFrame({
'feature': feature_columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
}
crop_performance[crop_type] = mae
return crop_performance
def predict_crop_yield(self, field_data, weather_forecast, soil_data):
"""
Predict crop yields for specified fields
"""
# Prepare features
features = self.prepare_agricultural_features(field_data, weather_forecast, soil_data)
predictions = []
for crop_type in features['crop_type'].unique():
if crop_type not in self.crop_models:
continue
crop_data = features[features['crop_type'] == crop_type]
model_info = self.crop_models[crop_type]
# Prepare prediction features
feature_columns = [
'growing_degree_days', 'heat_stress_days', 'frost_risk',
'cumulative_rainfall', 'days_since_rain', 'estimated_soil_moisture',
'day_of_year', 'growing_season', 'estimated_growth_stage',
'soil_ph', 'soil_organic_matter', 'soil_nitrogen', 'soil_phosphorus',
'fertilizer_nitrogen', 'fertilizer_phosphorus', 'irrigation_amount'
]
X = crop_data[feature_columns].fillna(0)
X_scaled = model_info['scaler'].transform(X)
# Make predictions
yield_predictions = model_info['model'].predict(X_scaled)
# Add to results
for i, prediction in enumerate(yield_predictions):
predictions.append({
'field_id': crop_data.iloc[i]['field_id'],
'crop_type': crop_type,
'predicted_yield': prediction,
'confidence_interval_lower': prediction * 0.85, # Simplified confidence interval
'confidence_interval_upper': prediction * 1.15,
'model_mae': model_info['mae']
})
return pd.DataFrame(predictions)
def optimize_resource_allocation(self, field_data, resource_constraints):
"""
Optimize allocation of water, fertilizer, and other resources
"""
optimization_results = []
# Available resources
total_water = resource_constraints.get('total_water_budget', 1000000) # liters
total_nitrogen = resource_constraints.get('total_nitrogen_budget', 5000) # kg
total_phosphorus = resource_constraints.get('total_phosphorus_budget', 2000) # kg
# Calculate resource needs per field
for _, field in field_data.iterrows():
field_id = field['field_id']
field_area = field['field_area_hectares']
crop_type = field['crop_type']
current_soil_moisture = field.get('current_soil_moisture', 0.3)
# Water optimization
optimal_soil_moisture = 0.6 # Target soil moisture
water_deficit = max(0, optimal_soil_moisture - current_soil_moisture)
water_needed = water_deficit * field_area * 10000 * 0.3 # liters per hectare
# Fertilizer optimization based on soil tests
nitrogen_deficiency = max(0, 40 - field.get('soil_nitrogen', 20)) # mg/kg
phosphorus_deficiency = max(0, 20 - field.get('soil_phosphorus', 10)) # mg/kg
nitrogen_needed = nitrogen_deficiency * field_area * 2.24 # kg per hectare
phosphorus_needed = phosphorus_deficiency * field_area * 1.12 # kg per hectare
# Priority scoring based on yield potential and deficiency
yield_potential = field.get('expected_yield', 5000) # kg per hectare
priority_score = (
(water_deficit * 0.4) +
(nitrogen_deficiency / 40 * 0.3) +
(phosphorus_deficiency / 20 * 0.2) +
(yield_potential / 10000 * 0.1)
)
optimization_results.append({
'field_id': field_id,
'crop_type': crop_type,
'field_area': field_area,
'water_needed_liters': water_needed,
'nitrogen_needed_kg': nitrogen_needed,
'phosphorus_needed_kg': phosphorus_needed,
'priority_score': priority_score,
'expected_yield_improvement': self._estimate_yield_improvement(
water_deficit, nitrogen_deficiency, phosphorus_deficiency
)
})
# Sort by priority and allocate resources
results_df = pd.DataFrame(optimization_results).sort_values('priority_score', ascending=False)
# Resource allocation with constraints
allocated_water = 0
allocated_nitrogen = 0
allocated_phosphorus = 0
for idx, row in results_df.iterrows():
# Allocate water
water_allocation = min(row['water_needed_liters'], total_water - allocated_water)
nitrogen_allocation = min(row['nitrogen_needed_kg'], total_nitrogen - allocated_nitrogen)
phosphorus_allocation = min(row['phosphorus_needed_kg'], total_phosphorus - allocated_phosphorus)
results_df.loc[idx, 'allocated_water_liters'] = water_allocation
results_df.loc[idx, 'allocated_nitrogen_kg'] = nitrogen_allocation
results_df.loc[idx, 'allocated_phosphorus_kg'] = phosphorus_allocation
# Calculate satisfaction ratios
results_df.loc[idx, 'water_satisfaction'] = (
water_allocation / row['water_needed_liters'] if row['water_needed_liters'] > 0 else 1.0
)
results_df.loc[idx, 'nutrient_satisfaction'] = (
(nitrogen_allocation / row['nitrogen_needed_kg'] +
phosphorus_allocation / row['phosphorus_needed_kg']) / 2
if row['nitrogen_needed_kg'] > 0 or row['phosphorus_needed_kg'] > 0 else 1.0
)
allocated_water += water_allocation
allocated_nitrogen += nitrogen_allocation
allocated_phosphorus += phosphorus_allocation
# Stop if resources exhausted
if (allocated_water >= total_water * 0.95 and
allocated_nitrogen >= total_nitrogen * 0.95 and
allocated_phosphorus >= total_phosphorus * 0.95):
break
return results_df
def _estimate_yield_improvement(self, water_deficit, nitrogen_deficiency, phosphorus_deficiency):
"""
Estimate yield improvement from addressing deficiencies
"""
# Simplified yield response curves
water_improvement = (1 - np.exp(-water_deficit * 5)) * 0.3 # Up to 30% improvement
nitrogen_improvement = (nitrogen_deficiency / 40) * 0.25 # Up to 25% improvement
phosphorus_improvement = (phosphorus_deficiency / 20) * 0.15 # Up to 15% improvement
# Combined effect (not additive due to limiting factors)
total_improvement = 1 - ((1 - water_improvement) *
(1 - nitrogen_improvement) *
(1 - phosphorus_improvement))
return min(total_improvement, 0.5) # Cap at 50% improvement
class CropHealthMonitoring:
"""
System for monitoring crop health and detecting diseases/pests
"""
def __init__(self):
self.disease_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)
self.pest_detector = RandomForestRegressor(n_estimators=100, random_state=42)
self.health_scorer = RandomForestRegressor(n_estimators=100, random_state=42)
def analyze_crop_images(self, image_features):
"""
Analyze crop images for health assessment
"""
# Simulate image analysis features
# In practice, this would use computer vision to extract:
# - Color histograms
# - Texture features
# - Shape characteristics
# - Disease symptoms
health_indicators = []
for _, image in image_features.iterrows():
# Extract health indicators from image features
green_ratio = image.get('green_pixels_ratio', 0.6)
yellow_ratio = image.get('yellow_pixels_ratio', 0.1)
brown_ratio = image.get('brown_pixels_ratio', 0.05)
texture_variation = image.get('texture_variation', 0.3)
# Calculate health score
health_score = (
green_ratio * 0.5 +
(1 - yellow_ratio) * 0.3 +
(1 - brown_ratio) * 0.2
)
# Detect potential issues
issues = []
if yellow_ratio > 0.3:
issues.append('Nutrient deficiency (likely nitrogen)')
if brown_ratio > 0.2:
issues.append('Disease or drought stress')
if texture_variation > 0.6:
issues.append('Possible pest damage')
health_indicators.append({
'field_id': image['field_id'],
'image_date': image['capture_date'],
'health_score': health_score,
'green_ratio': green_ratio,
'stress_indicators': yellow_ratio + brown_ratio,
'potential_issues': issues,
'recommended_action': self._recommend_action(health_score, issues)
})
return pd.DataFrame(health_indicators)
def _recommend_action(self, health_score, issues):
"""
Recommend actions based on health assessment
"""
if health_score < 0.3:
return "Immediate intervention required - investigate and treat"
elif health_score < 0.6:
return "Monitor closely and consider treatment"
elif len(issues) > 0:
return f"Address specific issues: {', '.join(issues)}"
else:
return "Continue normal monitoring"Machine learning enables governments to optimize public services, improve urban planning, and enhance citizen engagement.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
class SmartCityPlatform:
"""
Comprehensive smart city management and optimization platform
"""
def __init__(self):
self.traffic_optimizer = GradientBoostingRegressor(n_estimators=100, random_state=42)
self.crime_predictor = RandomForestClassifier(n_estimators=100, random_state=42)
self.service_demand_forecaster = RandomForestRegressor(n_estimators=100, random_state=42)
self.resource_allocator = KMeans(n_clusters=10, random_state=42)
self.scaler = StandardScaler()
def optimize_traffic_flow(self, traffic_data, infrastructure_data):
"""
Optimize traffic flow using ML-based signal control
"""
# Prepare traffic features
features = traffic_data.copy()
# Time-based features
features['hour'] = features['timestamp'].dt.hour
features['day_of_week'] = features['timestamp'].dt.dayofweek
features['is_rush_hour'] = ((features['hour'].isin([7, 8, 9, 17, 18, 19]))).astype(int)
features['is_weekend'] = (features['day_of_week'] >= 5).astype(int)
# Traffic volume features
features['volume_ratio'] = features['current_volume'] / features['road_capacity']
features['congestion_level'] = pd.cut(
features['volume_ratio'],
bins=[0, 0.3, 0.6, 0.8, 1.0, np.inf],
labels=['Free', 'Light', 'Moderate', 'Heavy', 'Gridlock']
)
# Historical patterns
features['avg_volume_same_hour'] = features.groupby(['intersection_id', 'hour'])['current_volume'].transform('mean')
features['volume_deviation'] = features['current_volume'] - features['avg_volume_same_hour']
# Weather impact
if 'weather_condition' in features.columns:
weather_impact = {
'clear': 1.0, 'rain': 0.8, 'snow': 0.6, 'fog': 0.7
}
features['weather_factor'] = features['weather_condition'].map(weather_impact).fillna(1.0)
features['adjusted_capacity'] = features['road_capacity'] * features['weather_factor']
# Optimize signal timing
optimization_results = []
for intersection_id in features['intersection_id'].unique():
intersection_data = features[features['intersection_id'] == intersection_id]
# Calculate optimal signal timing
total_volume = intersection_data['current_volume'].sum()
north_south_volume = intersection_data[
intersection_data['direction'].isin(['north', 'south'])
]['current_volume'].sum()
east_west_volume = intersection_data[
intersection_data['direction'].isin(['east', 'west'])
]['current_volume'].sum()
# Proportional timing based on volume
if total_volume > 0:
ns_ratio = north_south_volume / total_volume
ew_ratio = east_west_volume / total_volume
else:
ns_ratio = ew_ratio = 0.5
# Base cycle time (seconds)
base_cycle = 120
# Adjust for congestion
congestion_multiplier = 1 + (intersection_data['volume_ratio'].mean() - 0.5) * 0.3
adjusted_cycle = min(base_cycle * congestion_multiplier, 180) # Cap at 3 minutes
# Calculate green times
ns_green_time = adjusted_cycle * ns_ratio * 0.8 # 80% of cycle for green phases
ew_green_time = adjusted_cycle * ew_ratio * 0.8
optimization_results.append({
'intersection_id': intersection_id,
'recommended_cycle_time': adjusted_cycle,
'ns_green_time': ns_green_time,
'ew_green_time': ew_green_time,
'current_congestion': intersection_data['volume_ratio'].mean(),
'expected_improvement': self._estimate_traffic_improvement(
intersection_data['volume_ratio'].mean()
)
})
return pd.DataFrame(optimization_results)
def predict_crime_hotspots(self, crime_data, demographic_data, infrastructure_data):
"""
Predict crime hotspots for proactive policing
"""
# Prepare features for crime prediction
features = crime_data.merge(demographic_data, on='area_id', how='left')
features = features.merge(infrastructure_data, on='area_id', how='left')
# Time-based features
features['hour'] = features['incident_time'].dt.hour
features['day_of_week'] = features['incident_time'].dt.dayofweek
features['month'] = features['incident_time'].dt.month
features['is_night'] = ((features['hour'] >= 22) | (features['hour'] <= 5)).astype(int)
features['is_weekend'] = (features['day_of_week'] >= 5).astype(int)
# Demographic risk factors
features['unemployment_risk'] = features['unemployment_rate'] / 100
features['poverty_risk'] = features['poverty_rate'] / 100
features['education_risk'] = 1 - (features['high_school_graduation_rate'] / 100)
# Infrastructure factors
features['lighting_adequacy'] = features['street_lights_per_km'] / 10 # Normalized
features['police_proximity'] = 1 / (features['distance_to_police_station'] + 1)
features['commercial_density'] = features['commercial_establishments'] / features['area_sq_km']
# Historical crime patterns
crime_history = features.groupby(['area_id', 'crime_type']).size().reset_index(name='historical_count')
features = features.merge(crime_history, on=['area_id', 'crime_type'], how='left')
features['historical_count'] = features['historical_count'].fillna(0)
# Train crime prediction model
feature_columns = [
'hour', 'day_of_week', 'month', 'is_night', 'is_weekend',
'unemployment_risk', 'poverty_risk', 'education_risk',
'lighting_adequacy', 'police_proximity', 'commercial_density',
'historical_count', 'population_density'
]
X = features[feature_columns].fillna(0)
y = features['high_risk_area'] # Binary target: high risk vs low risk
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
self.crime_predictor.fit(X_train, y_train)
# Predict crime risk
y_pred = self.crime_predictor.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
# Generate risk predictions for all areas
risk_predictions = features.groupby('area_id').apply(
lambda x: self._calculate_area_risk_score(x, feature_columns)
).reset_index(name='risk_score')
# Rank areas by risk
risk_predictions['risk_category'] = pd.qcut(
risk_predictions['risk_score'],
q=5,
labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)
return risk_predictions, accuracy
def optimize_public_service_delivery(self, service_requests, resource_data, population_data):
"""
Optimize allocation of public services and resources
"""
# Analyze service demand patterns
demand_analysis = service_requests.groupby(['area_id', 'service_type']).agg({
'request_id': 'count',
'resolution_time_hours': 'mean',
'citizen_satisfaction': 'mean'
}).rename(columns={'request_id': 'request_count'})
# Merge with population and resource data
optimization_data = demand_analysis.reset_index()
optimization_data = optimization_data.merge(population_data, on='area_id', how='left')
optimization_data = optimization_data.merge(resource_data, on=['area_id', 'service_type'], how='left')
# Calculate service efficiency metrics
optimization_data['requests_per_capita'] = (
optimization_data['request_count'] / optimization_data['population']
)
optimization_data['resource_utilization'] = (
optimization_data['request_count'] / optimization_data['available_staff']
)
optimization_data['efficiency_score'] = (
optimization_data['citizen_satisfaction'] / optimization_data['resolution_time_hours']
)
# Identify optimization opportunities
optimization_recommendations = []
for service_type in optimization_data['service_type'].unique():
service_data = optimization_data[optimization_data['service_type'] == service_type]
# Identify underserved areas (high demand, low satisfaction)
underserved = service_data[
(service_data['requests_per_capita'] > service_data['requests_per_capita'].median()) &
(service_data['citizen_satisfaction'] < service_data['citizen_satisfaction'].median())
]
# Identify efficient areas (high satisfaction, low response time)
efficient = service_data[
(service_data['citizen_satisfaction'] > service_data['citizen_satisfaction'].quantile(0.75)) &
(service_data['resolution_time_hours'] < service_data['resolution_time_hours'].median())
]
# Generate recommendations
for _, area in underserved.iterrows():
recommendations = []
if area['resource_utilization'] > 2.0: # Overutilized
recommendations.append("Increase staffing levels")
if area['resolution_time_hours'] > service_data['resolution_time_hours'].quantile(0.75):
recommendations.append("Improve process efficiency")
if area['available_staff'] < service_data['available_staff'].median():
recommendations.append("Reallocate resources from efficient areas")
optimization_recommendations.append({
'area_id': area['area_id'],
'service_type': service_type,
'current_satisfaction': area['citizen_satisfaction'],
'current_response_time': area['resolution_time_hours'],
'priority_level': 'High' if len(recommendations) > 1 else 'Medium' if len(recommendations) > 0 else 'Low',
'recommendations': recommendations,
'estimated_improvement': self._estimate_service_improvement(area)
})
return pd.DataFrame(optimization_recommendations)
def _calculate_area_risk_score(self, area_data, feature_columns):
"""
Calculate crime risk score for an area
"""
if len(area_data) == 0:
return 0.5 # Default medium risk
X = area_data[feature_columns].fillna(0).mean().values.reshape(1, -1)
risk_probability = self.crime_predictor.predict_proba(X)[0][1] # Probability of high risk
return risk_probability
def _estimate_traffic_improvement(self, current_congestion):
"""
Estimate traffic flow improvement from signal optimization
"""
if current_congestion > 0.8:
return 0.25 # 25% improvement for heavily congested areas
elif current_congestion > 0.6:
return 0.15 # 15% improvement for moderately congested areas
else:
return 0.05 # 5% improvement for lightly congested areas
def _estimate_service_improvement(self, area_data):
"""
Estimate service delivery improvement potential
"""
current_efficiency = area_data['efficiency_score']
if current_efficiency < 0.3:
return "30-50% improvement possible"
elif current_efficiency < 0.6:
return "15-30% improvement possible"
else:
return "5-15% improvement possible"
class CitizenEngagementPlatform:
"""
Platform for enhancing citizen engagement and service delivery
"""
def __init__(self):
self.sentiment_analyzer = RandomForestClassifier(n_estimators=100, random_state=42)
self.issue_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
self.response_prioritizer = GradientBoostingRegressor(n_estimators=100, random_state=42)
def analyze_citizen_feedback(self, feedback_data):
"""
Analyze citizen feedback and complaints for insights
"""
feedback_analysis = []
for _, feedback in feedback_data.iterrows():
feedback_text = feedback['feedback_text']
# Sentiment analysis (simplified)
positive_words = ['good', 'great', 'excellent', 'satisfied', 'helpful', 'efficient']
negative_words = ['bad', 'terrible', 'awful', 'disappointed', 'frustrated', 'slow']
positive_count = sum(word in feedback_text.lower() for word in positive_words)
negative_count = sum(word in feedback_text.lower() for word in negative_words)
if positive_count > negative_count:
sentiment = 'positive'
sentiment_score = 0.7 + (positive_count - negative_count) * 0.1
elif negative_count > positive_count:
sentiment = 'negative'
sentiment_score = 0.3 - (negative_count - positive_count) * 0.1
else:
sentiment = 'neutral'
sentiment_score = 0.5
sentiment_score = max(0, min(1, sentiment_score)) # Clamp to [0,1]
# Issue categorization (simplified)
issue_keywords = {
'infrastructure': ['road', 'bridge', 'water', 'sewer', 'streetlight'],
'public_safety': ['police', 'crime', 'safety', 'emergency'],
'transportation': ['bus', 'traffic', 'parking', 'transit'],
'environment': ['waste', 'recycling', 'pollution', 'park'],
'administration': ['permit', 'license', 'tax', 'office']
}
issue_category = 'general'
max_matches = 0
for category, keywords in issue_keywords.items():
matches = sum(keyword in feedback_text.lower() for keyword in keywords)
if matches > max_matches:
max_matches = matches
issue_category = category
# Priority scoring
urgency_keywords = ['urgent', 'emergency', 'dangerous', 'broken', 'immediate']
urgency_score = sum(keyword in feedback_text.lower() for keyword in urgency_keywords)
priority_score = (
(1 - sentiment_score) * 0.4 + # Negative sentiment = higher priority
min(urgency_score / 2, 1) * 0.6 # Urgency indicators
)
feedback_analysis.append({
'feedback_id': feedback['feedback_id'],
'citizen_id': feedback['citizen_id'],
'area_id': feedback.get('area_id', 'unknown'),
'sentiment': sentiment,
'sentiment_score': sentiment_score,
'issue_category': issue_category,
'priority_score': priority_score,
'requires_followup': priority_score > 0.7 or sentiment == 'negative'
})
return pd.DataFrame(feedback_analysis)
def personalize_citizen_services(self, citizen_data, interaction_history):
"""
Personalize government services for citizens
"""
personalization_profiles = []
for citizen_id in citizen_data['citizen_id'].unique():
citizen_info = citizen_data[citizen_data['citizen_id'] == citizen_id].iloc[0]
citizen_interactions = interaction_history[
interaction_history['citizen_id'] == citizen_id
]
# Analyze interaction patterns
frequent_services = citizen_interactions['service_type'].value_counts()
interaction_channels = citizen_interactions['channel'].value_counts()
# Demographic-based recommendations
age_group = self._categorize_age(citizen_info['age'])
income_bracket = self._categorize_income(citizen_info.get('income', 50000))
# Generate personalized recommendations
service_recommendations = []
if age_group == 'senior':
service_recommendations.extend([
'Senior citizen benefits enrollment',
'Healthcare service navigation',
'Property tax exemptions'
])
elif age_group == 'young_adult':
service_recommendations.extend([
'Voter registration',
'Business license information',
'First-time homebuyer programs'
])
elif age_group == 'family':
service_recommendations.extend([
'School enrollment assistance',
'Family recreation programs',
'Child care resources'
])
# Income-based recommendations
if income_bracket == 'low':
service_recommendations.extend([
'Financial assistance programs',
'Housing assistance',
'Utility bill assistance'
])
# Channel preferences
preferred_channel = interaction_channels.index[0] if len(interaction_channels) > 0 else 'online'
personalization_profiles.append({
'citizen_id': citizen_id,
'age_group': age_group,
'income_bracket': income_bracket,
'preferred_channel': preferred_channel,
'frequent_services': frequent_services.head(3).index.tolist(),
'recommended_services': service_recommendations,
'engagement_score': len(citizen_interactions) / 12, # Interactions per month
'satisfaction_trend': citizen_interactions['satisfaction_rating'].rolling(3).mean().iloc[-1]
if len(citizen_interactions) >= 3 else 3.0
})
return pd.DataFrame(personalization_profiles)
def _categorize_age(self, age):
"""Categorize citizen by age group"""
if age < 25:
return 'young_adult'
elif age < 65:
return 'family'
else:
return 'senior'
def _categorize_income(self, income):
"""Categorize citizen by income bracket"""
if income < 30000:
return 'low'
elif income < 75000:
return 'middle'
else:
return 'high'
# Example usage
if __name__ == "__main__":
# Initialize smart city platform
smart_city = SmartCityPlatform()
engagement_platform = CitizenEngagementPlatform()
print("Smart City Platform Demo")
print("========================")
# Generate sample traffic data
np.random.seed(42)
traffic_data = pd.DataFrame({
'timestamp': pd.date_range('2023-01-01', periods=1000, freq='H'),
'intersection_id': np.random.choice(['INT_001', 'INT_002', 'INT_003'], 1000),
'direction': np.random.choice(['north', 'south', 'east', 'west'], 1000),
'current_volume': np.random.randint(10, 200, 1000),
'road_capacity': np.random.randint(150, 300, 1000),
'weather_condition': np.random.choice(['clear', 'rain', 'snow'], 1000, p=[0.7, 0.2, 0.1])
})
print(f"Traffic data shape: {traffic_data.shape}")
print(f"Average traffic volume: {traffic_data['current_volume'].mean():.1f}")
\newpage
## Implementation Best Practices
### Cross-Industry ML Implementation Guidelines
Successfully implementing machine learning across different industries requires following established best practices and avoiding common pitfalls.
#### Key Implementation Principles:
1. **Start with Clear Business Objectives**
- Define specific, measurable goals
- Align ML projects with business strategy
- Establish success metrics upfront
2. **Data Quality and Governance**
- Implement robust data collection processes
- Ensure data privacy and security compliance
- Establish data lineage and documentation
3. **Iterative Development Approach**
- Begin with proof-of-concept projects
- Use agile development methodologies
- Implement continuous integration/deployment
4. **Cross-functional Collaboration**
- Include domain experts throughout development
- Foster communication between technical and business teams
- Establish clear roles and responsibilities
5. **Ethical Considerations and Bias Mitigation**
- Regular bias testing and fairness assessments
- Transparent decision-making processes
- Inclusive development practices
#### Implementation Framework:
```python
class MLImplementationFramework:
"""
Comprehensive framework for implementing ML solutions across industries
"""
def __init__(self):
self.project_phases = [
'business_understanding',
'data_understanding',
'data_preparation',
'modeling',
'evaluation',
'deployment'
]
self.best_practices = {}
self.risk_assessments = {}
def assess_project_readiness(self, project_requirements):
"""
Assess organization's readiness for ML implementation
"""
readiness_score = 0
assessment_results = {}
# Data readiness (25% weight)
data_quality = project_requirements.get('data_quality_score', 0.5)
data_availability = project_requirements.get('data_availability', 0.5)
data_readiness = (data_quality + data_availability) / 2
readiness_score += data_readiness * 0.25
assessment_results['data_readiness'] = data_readiness
# Technical readiness (25% weight)
technical_skills = project_requirements.get('team_ml_expertise', 0.5)
infrastructure = project_requirements.get('technical_infrastructure', 0.5)
technical_readiness = (technical_skills + infrastructure) / 2
readiness_score += technical_readiness * 0.25
assessment_results['technical_readiness'] = technical_readiness
# Organizational readiness (25% weight)
leadership_support = project_requirements.get('leadership_commitment', 0.5)
change_management = project_requirements.get('change_readiness', 0.5)
organizational_readiness = (leadership_support + change_management) / 2
readiness_score += organizational_readiness * 0.25
assessment_results['organizational_readiness'] = organizational_readiness
# Business case strength (25% weight)
roi_clarity = project_requirements.get('roi_potential', 0.5)
problem_definition = project_requirements.get('problem_clarity', 0.5)
business_readiness = (roi_clarity + problem_definition) / 2
readiness_score += business_readiness * 0.25
assessment_results['business_readiness'] = business_readiness
# Overall assessment
assessment_results['overall_readiness'] = readiness_score
assessment_results['recommendation'] = self._get_readiness_recommendation(readiness_score)
return assessment_results
def _get_readiness_recommendation(self, score):
"""
Provide recommendation based on readiness score
"""
if score >= 0.8:
return "Ready to proceed with full implementation"
elif score >= 0.6:
return "Proceed with pilot project and address identified gaps"
elif score >= 0.4:
return "Significant preparation needed before implementation"
else:
return "Not ready for ML implementation - focus on foundational capabilities"
def generate_implementation_roadmap(self, project_scope, timeline_months=12):
"""
Generate phased implementation roadmap
"""
phases = []
# Phase 1: Foundation (Months 1-3)
phases.append({
'phase': 'Foundation',
'duration_months': 3,
'objectives': [
'Establish data infrastructure',
'Build ML team capabilities',
'Define success metrics',
'Create governance framework'
],
'deliverables': [
'Data architecture design',
'Team training completion',
'Success criteria document',
'Governance policies'
],
'success_criteria': [
'Data pipeline operational',
'Team certified in ML tools',
'KPIs defined and approved',
'Compliance framework established'
]
})
# Phase 2: Pilot Development (Months 4-6)
phases.append({
'phase': 'Pilot Development',
'duration_months': 3,
'objectives': [
'Develop proof-of-concept model',
'Test data quality and availability',
'Validate technical approach',
'Assess business impact'
],
'deliverables': [
'Working prototype',
'Data quality report',
'Technical validation results',
'Initial ROI assessment'
],
'success_criteria': [
'Model meets accuracy requirements',
'Data pipeline handles expected load',
'Positive stakeholder feedback',
'Clear path to business value'
]
})
# Phase 3: Production Implementation (Months 7-9)
phases.append({
'phase': 'Production Implementation',
'duration_months': 3,
'objectives': [
'Deploy production-ready system',
'Integrate with existing workflows',
'Implement monitoring and alerts',
'Train end users'
],
'deliverables': [
'Production system deployment',
'Integration documentation',
'Monitoring dashboard',
'User training materials'
],
'success_criteria': [
'System meets SLA requirements',
'Successful integration testing',
'Monitoring captures key metrics',
'Users adopted new processes'
]
})
# Phase 4: Optimization and Scaling (Months 10-12)
phases.append({
'phase': 'Optimization and Scaling',
'duration_months': 3,
'objectives': [
'Optimize model performance',
'Scale to additional use cases',
'Implement continuous improvement',
'Measure business impact'
],
'deliverables': [
'Performance optimization report',
'Scaling implementation plan',
'Continuous learning pipeline',
'Business impact analysis'
],
'success_criteria': [
'Improved model accuracy/efficiency',
'Successful scaling validation',
'Automated model updates',
'Documented business benefits'
]
})
return phasesDifferent industries have unique regulatory frameworks that impact ML implementation:
Machine learning applications span virtually every industry, offering transformative potential for organizations willing to invest in data-driven approaches. Success requires careful attention to industry-specific requirements, regulatory compliance, and organizational readiness.
The key to successful ML implementation lies in:
As machine learning technologies continue to evolve, organizations that develop strong foundational capabilities and domain expertise will be best positioned to leverage these powerful tools for competitive advantage and societal benefit.
The examples and frameworks provided in this appendix serve as starting points for industry-specific ML implementations. Organizations should adapt these approaches based on their unique requirements, constraints, and opportunities while maintaining focus on delivering value to stakeholders and end users. # Epilogue: The Dawn of Your ML Journey
“Every ending is a new beginning in disguise.”
As you close this book and reflect on the incredible journey we’ve shared, take a moment to recognize the profound transformation you’ve undergone. You began as a curious learner, perhaps uncertain about the mathematical foundations and overwhelmed by the possibilities. You now stand as a comprehensive machine learning practitioner✅equipped with both the technical mastery and ethical wisdom to build intelligent systems that serve humanity.
We began with the philosophical questions that define our field✅What is learning? How do machines discover patterns? Through Tom Mitchell’s formal definitions and Russell & Norvig’s AI frameworks, we built a solid theoretical foundation. We learned that data is not just numbers but stories waiting to be told, and that preprocessing is not just cleaning but the art of preparing data to reveal its secrets.
In the heart of our journey, we mastered the core algorithms that have powered the ML revolution. From the information-theoretic elegance of decision trees to the geometric beauty of support vector machines, from the statistical foundations of regression to the clustering wisdom that finds hidden communities in data✅each algorithm became a tool in your growing toolkit and a lens for seeing patterns in complexity.
Here we transcended mere technique to embrace the artistry of machine learning. We learned to see through high dimensions with PCA’s mathematical eyes, to orchestrate end-to-end projects like symphonies of intelligence, and to evaluate models with the rigor of scientific inquiry. This is where you evolved from a user of algorithms to a creator of intelligent solutions.
Our final act addressed the most crucial question of our time: How do we ensure that the intelligence we create serves humanity’s highest aspirations? You’ve become not just a practitioner but a guardian✅someone who understands that with algorithmic power comes ethical responsibility.
As you venture into the professional world of machine learning, you’ll encounter questions that don’t have textbook answers:
🤔 The Consciousness Frontier: What happens when AI systems begin to exhibit behaviors indistinguishable from consciousness? How will we recognize and respect non-human intelligence?
🌍 The Global Challenge: How can we ensure that AI’s transformative power reaches every corner of humanity, bridging rather than widening the digital divide?
🔮 The Singularity Question: How do we maintain human agency and purpose in a world where machines surpass human cognitive abilities in most domains?
🧬 The Enhancement Dilemma: Where do we draw the line between using AI to augment human capabilities and fundamentally altering what it means to be human?
“��️ The Governance Puzzle: What new forms of democratic participation will emerge to govern AI systems that affect billions of lives?
You are not just a practitioner of machine learning✅you are a co-author of humanity’s next chapter. The algorithms you build, the biases you eliminate, the fairness you embed, and the transparency you provide will ripple through time, affecting generations yet unborn.
Keep your eyes on these emerging frontiers that will define the next decade of ML:
Remember that you don’t walk this path alone. You’re joining a global community of researchers, practitioners, and ethicists who share your passion for building technology that elevates humanity. This community values:
Your formal education in machine learning may be complete, but your real education is just beginning. The field evolves so rapidly that today’s cutting-edge becomes tomorrow’s foundation. Embrace this as a feature, not a bug✅it means you’ll never stop growing, never stop discovering, never stop being amazed by what’s possible.
As you stand at the threshold of your professional journey, ask yourself: What kind of future do you want to help create? Your answer to this question will guide every algorithmic decision, every model architecture choice, and every deployment strategy you make.
The mathematical theorems live in your mind. The programming patterns flow through your fingers. The ethical frameworks guide your conscience. The statistical intuition sharpens your judgment.
But most importantly, the wonder and curiosity that brought you to machine learning in the first place✅guard that flame carefully. It will light your way through challenges yet to come and inspire innovations yet to be imagined.
Machine learning is what game theorist James Carse would call an “infinite game”✅a game played for the purpose of continuing play, where the goal is not to end the game but to keep it going, to bring more and more people into the play.
Unlike finite games with clear winners and losers, the infinite game of machine learning invites all of humanity to participate in the grand project of understanding intelligence, augmenting human capability, and exploring the outer reaches of what’s possible when mathematics meets creativity.
The tutorials are complete. The theory is learned. The ethics are internalized. The tools are in your hands.
Now the real adventure begins.
Go forth and build systems that learn, adapt, and evolve. Create algorithms that see patterns humans miss and make connections humans couldn’t imagine. But always remember: the most sophisticated AI is only as wise as the human who builds it with love, integrity, and hope for a better future.
The future of artificial intelligence is not predetermined✅it will be written by people like you, one line of code, one ethical decision, one moment of wonder at a time.
“The best time to plant a tree was 20 years ago. The second best time is now. The best time to build ethical AI was at the dawn of computing. The second best time is right now.”
Welcome to the infinite game. Your move.
To the giants whose shoulders we stand on✅Tom Mitchell, Stuart Russell, Peter Norvig, and countless others who built the foundations of our field. To the open-source community that democratized access to powerful tools. To the students and practitioners who will carry this knowledge forward into uncharted territories.
And to you, dear reader, who invested your time and energy in mastering these concepts. The future of AI is brighter because you are in it.
The story continues… # REFERENCES AND BIBLIOGRAPHY
This section provides comprehensive citations for all sources, research papers, books, and online resources referenced throughout this textbook.
[1] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer-Verlag New York. ISBN: 978-0387310732.
[2] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer Series in Statistics. ISBN: 978-0387848570.
[3] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. ISBN: 978-0262018029.
[4] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. ISBN: 978-1461471370.
[5] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O’Reilly Media. ISBN: 978-1492032649.
[6] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. ISBN: 978-0262035613.
[7] Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson. ISBN: 978-0134610993.
[8] Mitchell, T. M. (1997). Machine Learning. McGraw-Hill Education. ISBN: 978-0070428072.
[9] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
[10] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097-1105.
[11] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
[12] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
[13] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267-288.
[14] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.
[15] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559-572.
[16] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 14(2), 1137-1145.
[17] Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861-874.
[18] Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.
[19] Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
[20] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.
[21] Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer Series in Statistics. ISBN: 978-0387954424.
[22] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.
[23] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765-4774.
[24] Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
[25] McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 51-56.
[26] Harris, C. R., et al. (2020). Array programming with NumPy. Nature, 585(7825), 357-362.
[27] Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90-95.
[28] Waskom, M. L. (2021). seaborn: statistical data visualization. Journal of Open Source Software, 6(60), 3021.
[29] Scikit-learn Documentation. Retrieved from https://scikit-learn.org/stable/
[30] Python Software Foundation. Python Language Reference, version 3.8+. Available at https://docs.python.org/3/
[31] NumPy Documentation. Retrieved from https://numpy.org/doc/stable/
[32] Pandas Documentation. Retrieved from https://pandas.pydata.org/docs/
[33] Matplotlib Documentation. Retrieved from https://matplotlib.org/stable/contents.html
[34] Maharashtra State Board of Technical Education (MSBTE). (2023). Curriculum for Computer Technology - Course Code 316316: Machine Learning. Mumbai: MSBTE Publications.
[35] All India Council for Technical Education (AICTE). (2022). Model Curriculum for Computer Science and Engineering. New Delhi: AICTE.
[36] IEEE/ACM Computing Curricula. (2020). Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science. ACM/IEEE.
[37] Bloom, B. S. (1956). Taxonomy of Educational Objectives: The Classification of Educational Goals. Longmans, Green.
[38] Anderson, L. W., & Krathwohl, D. R. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Allyn & Bacon.
[39] Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.
[40] Esteva, A., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118.
[41] Rajkomar, A., et al. (2018). Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1(1), 18.
[42] Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning. fairmlbook.org.
[43] O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Books.
[44] Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389-399.
[45] Fisher, R. A. (1936). Iris flower dataset. UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/iris
[46] Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1995). Breast Cancer Wisconsin (Diagnostic) dataset. UC Irvine Machine Learning Repository.
[47] Forina, M., et al. (1991). Wine recognition dataset. UC Irvine Machine Learning Repository.
[48] Harrison, D., & Rubinfeld, D. L. (1978). Boston Housing dataset. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1), 81-102.
[49] Dua, D., & Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml
[50] Kaggle Inc. (2023). Kaggle Datasets. Retrieved from https://www.kaggle.com/datasets
[51] Google Research. (2023). Google Dataset Search. Retrieved from https://datasetsearch.research.google.com/
[52] OpenML Foundation. (2023). OpenML: An open science platform for machine learning. Retrieved from https://www.openml.org/
[53] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236), 433-460.
[54] McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115-133.
[55] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.
[56] McCarthy, J., et al. (1955). A proposal for the Dartmouth summer research project on artificial intelligence. http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
[57] Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370-418.
[58] Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246-263.
[59] Brown, T., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
[60] Devlin, J., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[61] Dosovitskiy, A., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[62] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD, 1135-1144.
[63] Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. International Conference on Machine Learning, 3145-3153.
[64] Coursera Inc. Machine Learning Course by Andrew Ng. Retrieved from https://www.coursera.org/learn/machine-learning
[65] edX Inc. MIT Introduction to Machine Learning. Retrieved from https://www.edx.org/course/introduction-to-machine-learning
[66] Khan Academy. Statistics and Probability. Retrieved from https://www.khanacademy.org/math/statistics-probability
[67] Towards Data Science. Medium Publication. Retrieved from https://towardsdatascience.com/
[68] Machine Learning Mastery. Jason Brownlee. Retrieved from https://machinelearningmastery.com/
[69] Distill.pub. Visual Explanations of Machine Learning. Retrieved from https://distill.pub/
[70] IEEE Standards Association. (2021). IEEE Standard for Artificial Intelligence (AI) - Model Process. IEEE Std 2857-2021.
[71] Association for Computing Machinery (ACM). (2018). ACM Code of Ethics and Professional Conduct. Retrieved from https://www.acm.org/code-of-ethics
[72] International Organization for Standardization (ISO). (2023). ISO/IEC 23053:2022 - Framework for AI systems using ML. Geneva: ISO.
[73] Partnership on AI. (2023). AI Tenets. Retrieved from https://www.partnershiponai.org/
[74] Jupyter Project. (2023). Project Jupyter. Retrieved from https://jupyter.org/
[75] Google Colaboratory. (2023). Retrieved from https://colab.research.google.com/
[76] Anaconda Inc. (2023). Anaconda Distribution. Retrieved from https://www.anaconda.com/
[77] GitHub Inc. (2023). GitHub Platform. Retrieved from https://github.com/
[78] Git Software. (2023). Git Version Control. Retrieved from https://git-scm.com/
This bibliography follows a hybrid citation style combining elements
from: - APA (American Psychological Association) for
academic papers and books - IEEE for technical
standards and conference proceedings
- Chicago Manual of Style for historical references
All web resources include access dates where retrieval dates are relevant to content currency.
The authors gratefully acknowledge all researchers, educators, and practitioners whose work has contributed to the field of machine learning and made this textbook possible. Special recognition goes to the open-source community for creating tools that democratize access to machine learning education and research.
Every effort has been made to accurately cite all sources and provide appropriate attribution. Any omissions or errors in citation are unintentional, and corrections will be made in future editions.
APA Format:
Chatake, A. (2025). Machine learning: A comprehensive guide to artificial intelligence
and data science. Chatake Innoworks Publications.
IEEE Format:
A. Chatake, "Machine Learning: A Comprehensive Guide to Artificial Intelligence
and Data Science," Chatake Innoworks Publications, 2025.
Chicago Format:
Chatake, Akash. Machine Learning: A Comprehensive Guide to Artificial Intelligence
and Data Science. India: Chatake Innoworks Publications, 2025.
For updates to this bibliography or to suggest additional references, please contact the author through the official channels listed in the front matter. # INDEX
Comprehensive alphabetical reference for all topics, concepts, algorithms, and terms covered in this textbook.
Accuracy, 45, 78, 156, 189, 245
- Cross-validation accuracy, 201-205
- Training vs. testing accuracy, 167-169
Activation Functions, 89, 156, 234
- ReLU, 234
- Sigmoid, 156-159, 234
- Tanh, 234
Algorithms
- Classification algorithms, 45-78
- Comparison of algorithms, 189-195
- Selection guidelines, 195-197
Artificial Intelligence (AI), 1-15, 301-315
- Definition and scope, 1-3
- History of AI, 305-308
- Narrow vs. General AI, 302-303
- Types of AI, 2-3
Artificial Neural Networks, 234-240
- Backpropagation, 237-238
- Forward propagation, 236-237
- Multi-layer perceptrons, 235-236
AUC (Area Under Curve), 178-182
Bagging, 67-69
- Bootstrap aggregating, 67
- Random Forest, 67-70
Bias, 34, 167, 234
- Bias-variance tradeoff, 167-168
- In machine learning models, 34-35
Binary Classification, 45-47, 156-159
Bootstrap Sampling, 67-68
Classification, 45-197
- Binary classification, 45-47
- Multi-class classification, 47-48, 162-164
- Multi-label classification, 48
- Evaluation metrics, 172-189
Classification Algorithms
- Decision Trees, 49-70
- K-Nearest Neighbors, 71-98
- Logistic Regression, 156-170
- Random Forest, 67-70
- Support Vector Machines, 99-155
Clustering, 245-267
- Hierarchical clustering, 256-259
- K-Means clustering, 245-255
- DBSCAN, 259-262
Confusion Matrix, 172-175
Cost Function, 156-159, 234
Cross-Validation, 26-30, 183-186
- K-fold cross-validation, 27-29
- Leave-one-out cross-validation, 29
- Stratified cross-validation, 29-30
Curse of Dimensionality, 98, 117-119
Data Preprocessing, 16-44
- Data cleaning, 17-22
- Feature scaling, 31-35
- Handling missing values, 22-26
- Train-test split, 26-30
Data Quality, 17-22
- Outlier detection, 19-21
- Data validation, 21-22
Datasets
- Boston Housing, 198, 220-225
- Breast Cancer, 52-56, 99-105
- Iris, 1, 52-56, 71-75, 172-189
- Wine, 75-78, 195-197
Decision Trees, 49-70
- Entropy, 53-55
- Gini impurity, 55-56
- Information gain, 53-55
- Overfitting and pruning, 56-61
- Visualization, 52-53
Deep Learning, 234-240, 308-310
- Convolutional Neural Networks, 238-240
- Neural networks, 234-238
Dimensionality Reduction, 117-134, 267-289
- Linear Discriminant Analysis (LDA), 128-134
- Principal Component Analysis (PCA), 117-128
- t-SNE, 278-282
Ensemble Methods, 67-70
- Bagging, 67-69
- Boosting, 69-70
- Random Forest, 67-70
Entropy, 53-55
Evaluation Metrics
- Classification metrics, 172-189
- Regression metrics, 225-232
Exploratory Data Analysis (EDA), 17-19
F1-Score, 176-178
Feature Engineering, 99-155
- Feature creation, 134-140
- Feature extraction, 117-134
- Feature scaling, 103-109
- Feature selection, 109-117
Feature Selection, 109-117
- Embedded methods, 115-117
- Filter methods, 109-112
- Wrapper methods, 112-115
Gradient Descent, 159-162, 234
Grid Search, 78-81, 152-155
Hyperparameter Tuning, 78-81, 152-155
Hyperparameters
- Decision Trees, 61-64
- KNN, 81-85
- SVM, 152-155
Imputation, 22-26
- Mean/median imputation, 23-24
- Forward/backward fill, 24-25
- Advanced imputation, 25-26
Information Gain, 53-55
K-Fold Cross-Validation, 27-29
K-Means Clustering, 245-255
K-Nearest Neighbors (KNN), 71-98
- Distance metrics, 75-78
- Implementation from scratch, 85-90
- Optimal K selection, 81-85
- Weighted KNN, 90-92
Kernel Trick, 140-148
- Linear kernel, 141
- Polynomial kernel, 142-144
- RBF kernel, 144-146
- Sigmoid kernel, 146-148
Learning Curves, 186-189
Linear Discriminant Analysis (LDA), 128-134
Linear Regression, 198-215
- Multiple linear regression, 205-210
- Simple linear regression, 198-205
Logistic Regression, 156-170
- Cost function, 156-159
- Multi-class logistic regression, 162-164
- Regularization, 164-167
- Sigmoid function, 156-159
Machine Learning
- Definition, 1-3
- Supervised vs. unsupervised, 3-6
- Types of learning, 3-6
Mean Absolute Error (MAE), 225-228
Mean Squared Error (MSE), 225-228
Missing Values, 22-26
Model Evaluation, 172-189, 225-232
Multi-class Classification, 47-48, 162-164
Mutual Information, 134-137
Neural Networks, 234-240
Normalization, 31-35
- Min-Max normalization, 32-33
- Z-score normalization, 33-34
One-Hot Encoding, 35-38
Outliers, 19-21, 34-35
Overfitting, 56-61, 167-169
- In decision trees, 56-61
- Prevention techniques, 61-64
Precision, 176-178
Principal Component Analysis (PCA), 117-128
- Eigenvalues and eigenvectors, 119-122
- Implementation, 122-125
- Interpretation, 125-128
Pruning, 61-64
Python Libraries
- Matplotlib, 8-12
- NumPy, 6-8
- Pandas, 8-10
- Scikit-learn, 12-15
R-squared, 228-232
Random Forest, 67-70
Recall, 176-178
Regression, 198-243
- Linear regression, 198-215
- Polynomial regression, 215-220
- Ridge regression, 220-225
Regularization, 164-167, 220-225
- L1 regularization (Lasso), 222-225
- L2 regularization (Ridge), 220-222
ROC Curve, 178-182
Scaling, 31-35, 103-109
- Min-Max scaling, 104-106
- Robust scaling, 108-109
- Standard scaling, 106-108
SHAP Values, 137-140
Sigmoid Function, 156-159
Supervised Learning, 3-5, 45-243
Support Vector Machines (SVM), 99-155
- Kernel trick, 140-148
- Linear SVM, 99-105
- Non-linear SVM, 140-148
- Parameter tuning, 152-155
Train-Test Split, 26-30
Underfitting, 167-169
Unsupervised Learning, 5-6, 245-289
Validation Curves, 64-67, 152-155
Variance, 167-169
Weighted KNN, 90-92
80-20 Rule, 26-27
Appendix A: Mathematical Prerequisites,
290-295
Appendix B: Python Setup Guide, 296-298
Appendix C: Dataset Sources, 299-300
| Algorithm | Page | Chapter |
|---|---|---|
| Decision Trees | 49-70 | 4 |
| K-Nearest Neighbors | 71-98 | 4 |
| Support Vector Machines | 99-155 | 4 |
| Logistic Regression | 156-170 | 4 |
| Linear Regression | 198-215 | 5 |
| Ridge Regression | 220-225 | 5 |
| Lasso Regression | 222-225 | 5 |
| Random Forest | 67-70 | 4 |
| K-Means Clustering | 245-255 | 6 |
| PCA | 117-128 | 3 |
| LDA | 128-134 | 3 |
| Concept | Page | Description |
|---|---|---|
| Entropy | 53-55 | Measure of information content |
| Information Gain | 53-55 | Reduction in entropy |
| Euclidean Distance | 75-76 | Standard distance metric |
| Manhattan Distance | 76-77 | L1 distance metric |
| Cosine Similarity | 77-78 | Angle-based similarity |
| Eigenvalues | 119-122 | PCA mathematical foundation |
| Gradient Descent | 159-162 | Optimization algorithm |
| Cross-Entropy | 156-159 | Logistic regression loss |
Page numbers are approximate and may vary in the final printed version. # BACK COVER
🎯 Master the Future of Technology with This Complete Educational Resource
In an era where artificial intelligence reshapes every industry, machine learning has become the most critical skill for technology professionals. This comprehensive textbook bridges the gap between theoretical computer science and practical, industry-relevant AI applications.
✅ Solid Foundations: Mathematical principles
underlying ML algorithms
✅ Hands-On Skills: 100+ Python code examples with real
datasets
✅ Industry Applications: Real-world case studies and
projects
✅ Ethical AI: Responsible development practices and
considerations
✅ Complete Coverage: From data preprocessing to
advanced algorithms
Chapter 1: Introduction to Machine Learning
Chapter 2: Data Preprocessing and Validation
Chapter 3: Feature Engineering Mastery
Chapter 4: Classification Algorithms
Chapter 5: Regression Techniques
Plus: Advanced topics, projects, and assessments
✅ MSBTE-Aligned: Complete curriculum coverage for
Course 316316
✅ Progressive Learning: Carefully scaffolded
difficulty progression
✅ Theory + Practice: Mathematical rigor with hands-on
implementation
✅ Industry-Relevant: Skills used in professional ML
teams
✅ Ethical Focus: Responsible AI development integrated
throughout
✅ Open Source: All tools and datasets freely
available
“Finally, a textbook that explains ML concepts clearly without
sacrificing depth. The code examples are invaluable!”
✅ Computer Engineering Student
“Perfect balance of theory and practice. I could implement
everything I learned immediately.”
✅ Data Science Aspirant
“The best ML textbook for Indian engineering curriculum. Highly
recommended!”
✅ Faculty Member, Technical Institute
After completing this book, you will be able to:
Languages: Python 3.8+
Libraries: Scikit-learn, NumPy, Pandas, Matplotlib,
Seaborn
Tools: Jupyter Notebooks, Git, GitHub
Algorithms: Decision Trees, SVM, KNN, Logistic
Regression, Random Forest
Techniques: Cross-validation, Feature selection,
Hyperparameter tuning
This textbook has been developed with input from: - Industry ML Practitioners from leading tech companies - Academic Experts in computer science education - MSBTE Curriculum Specialists ensuring standards compliance - Student Beta Testers from multiple institutions
Companion Website: Complete code repository,
datasets, and updates
Online Community: Join thousands of learners in our
study groups
Instructor Resources: Slides, solution manuals, and
assessment tools
Video Supplements: Key concepts explained through
visual demonstrations
Akash Chatake is the founder of Chatake Innoworks and a passionate technology educator with extensive experience in AI/ML education. He has guided hundreds of students from beginners to ML practitioners and continues to contribute to making advanced technology education accessible to all.
Transform your understanding of artificial intelligence and machine learning. Whether you’re a student preparing for examinations, a professional seeking career advancement, or an educator looking for comprehensive course material, this book provides everything you need to master machine learning.
Visit: [Website URL]
Email: [Contact Email]
Community: [Learning Community Link]
ISBN: 978-X-XXXX-XXXX-X
Publisher: Chatake Innoworks Publications
Category: Computer Science / Machine Learning
Level: Intermediate to Advanced
Edition: First Edition, 2025
🏷️ Price: ₹XXX (Print) | ₹XXX (Digital)
🌍 Available: India and International
📱 Formats: Hardcover, Paperback, PDF, EPUB
Start your machine learning journey today!